Inferring Global Chlorophyll-a Depth Profiles from PACE Hyperspectral Rrs

Eli Holmes

Motivation: Global CHLA Depth Profiles

Satellite chlorophyll products primarily represent near-surface waters.
Ecosystems and biogeochemical processes depend on the vertical distribution of chlorophyll-a.
Vertical structure provides information on:
- Subsurface chlorophyll maxima
- Light–nutrient compensation
- Mixed-layer dynamics and stratification
At present, there is no global, routinely produced product of global daily CHLA depth profiles.

Relevance for Fisheries

Many fish species use habitats that are structured vertically.
Early life stages and forage species often depend on prey fields associated with subsurface productivity layers, especially across the full mixed layer depth.
Knowing the vertical distribution of CHLA can contribute to:
- Habitat models and species distribution models
- Recruitment and survival studies
- Understanding the phenology of phytoplankton vertical migration

Relevance for the Carbon Cycle

The depth distribution of phytoplankton influences:
- Depth of primary production
- Carbon export efficiency and remineralization
- Radiative transfer and heating within the upper ocean
Subsurface chlorophyll maxima can contribute substantially to net primary production.
Accurate CHLA profiles are therefore important inputs for biogeochemical and carbon-cycle models.

PACE Mission: Hyperspectral Ocean Color

PACE provides hyperspectral water-leaving radiance reflectances, Rrs(λ), across the visible spectrum.
These spectra capture:
- Pigment composition and concentration
- Particle size and concentration
- Optical properties that are influenced by subsurface structure

What Does Hyperspectral Rrs Look Like Compared to Multi-Spectral?

Why Hyperspectral Rrs Gives New Insights into CHLA at depth

Traditional ocean color sensor provided 3-6 spectra so limited information on the full shape.
The additional spectral information relative to multispectral sensors offers new leverage for inferring vertical CHLA structure.

Why Rrs Contains Subsurface Information

Even though Rrs is measured above the surface:

Photons penetrate the water column and interact with particles at depth.
Returned Rrs integrates contributions from multiple depths, shaped by inherent optical properties.
Hyperspectral structure preserves subtle depth-linked signals.

PACE’s spectral richness gives us new power to infer what’s happening at depth.

Results spoiler. It works!

Caveat: Mostly trained on data from open ocean (BGC-Argo). Most data was for surface CHLA < 1. Need more coastal data. Though it does do a decent job for the surface CHLA > 10 data that we do have.

General Idea

Get CHLA depth (z) profiles from in-situ platforms (BGC-Argo and Moorings)
Train CHLA(z) on hyperspectral Rrs using Boosted Regression Trees
Predict CHLA(z)
Create daily global maps of CHLA(z) using PACE Rrs daily files

BGC-Argo

In-situ Platforms: BGC-Argo

Continuous vertical CHLA profiles
Global
Deep Water (>1000m) not coastal

A Chlorophyll Profile from an BGC-Argo Buoy

Ocean Observing Initiative (OOI)

Fixed moorings with fluorometers at discrete depths
High-frequency CHLA, temperature, salinity
Key for shallow structure, seasonal cycles, and diel variability

OOI Provides Coastal and Shelf Data

Preparing Data for Training

Binning
Matchups to PACE Rrs

Depth Binning in-situ Data (Critical Step)

Raw profiles come at irregular depths → ML models need consistent targets.

Standardize to bins (e.g., 0–10 m, 10–20 m, …).
For each profile:
- Interpolate or average CHLA within each bin.
- Keep bins even if other depths are missing.
Many profiles have partial coverage but still provide useful training data.

👉 This is why we train one BRT per depth bin instead of forcing a single parametric profile shape.

Matching Step: Match in-situ lat/lon/day to PACE Level 3 daily data

Get Rrs for all 128 wavelengths for every row of data
Remove any rows with no Rrs. 80% has no Rrs since PACE data is from swaths and is not gap-filled.
Remove data with that seems to be bio-fouled (sensor issue). Ca 20 rows of data
Result: 5540 in-situ CHLA depth profiles. Roughly 450 to 700 for each depth bin.

The Modeling Step

Add on solar_hour to the feature set because that has a big impact on in-situ CHLA estimates
Add on the type of platform (Argo versus OOI) since those are little different
Divide the data into training (80%) and testing (20%)
Fit BRT to the training data
Evaluate performance using the test data (CHLA depth profiles)

What is a Boosted Regression Tree (BRT)?

Ensemble of many shallow decision trees
Each tree learns the residuals of the previous ones (“boosting”)
Handles:
- Nonlinear relationships
- High-dimensional spectral predictors
- Heterogeneous covariates
Makes no assumptions about the vertical shape of CHLA profiles.
I used sklearn HistGradientBoostingRegressor() function. Standard. Easy.

One BRT Per Depth Bin

Each model predicts CHLA in a single depth range:

Input:
- Hyperspectral PACE Rrs(λ)
- Extras (solar hour, platform type)
Output:
- CHLA in that bin (e.g., 0–10 m, 10–20 m, …)

Benefits:

Uses all data because it can use partial profiles (only bins with valid CHLA).
Avoids imposing a fixed profile shape.
Enables simple diagnostics: R², RMSE, bias vs depth.

Results


Loaded ML bundle from: models/brt_chla_profiles_bundle.zip
  model_kind : pickle
  model_type : collection (dict), n_submodels=20
  example key: CHLA_0_10
  target     : log10_CHLA_A_B depth bins
  features   : 174 columns
  train/test : 4408 / 1102 rows
  dataset    : 5510 rows stored in bundle

Usage example (Python):
  bundle = load_ml_bundle('path/to/bundle.zip')
  # Predict using helper 'predict_all_depths_for_day'
  # Example: predict all depths for one day from a BRF dataset R
  pred = bundle.predict(
      R_dataset,                  # xr.DataArray/xr.Dataset with lat/lon + predictors
      brt_models=bundle.model,    # dict of models by depth bin
      feature_cols=bundle.meta['feature_cols'],
      consts={'solar_hour': 12.0, 'type': 1},
  )  # -> e.g. CHLA(time?, z, lat, lon)

  # Plot using helper 'make_plot_pred_map'
  fig, ax = bundle.plot(pred_da, pred_label='Prediction')

Performance

Overall we see a close match between our predictions and our test data for CHLA in the upper 10m — for non-coastal data.

Prediction of CHLA 0 to 10m

Using PACE data for hyperspectral Rrs, we can make a prediction of surface CHLA. Let’s compare to PACE’s surface CHLA product as a first pass check, but note that these are different products trained on different in-situ data. Note surface CHLA is not our goal.

We do not expect these to be identical as the PACE chlor_a is based on the classic Rrs ratio algorithm while the BRT uses the whole spectrum but most importantly was trained on Argo and OOI florometer measurements.

Scatter plot

PACE chlor_a to BRT with type = 1 (ooi) and solar_hour = 0 (midnight)

Notice that at high PACE chlor_a, the BRT model predicts lower CHLA_0_10. We might be able to correct this by using the whole CHLA depth profile (next section).

Global prediction

We can do this for the whole globe.

Let’s predict CHLA depth profiles

CHLA Profiles: Observed vs BRT

When surface CHLA is high?

Coastal waters?

Not enough data yet. No correction was done for bathymetry (scattering off bottom), so that will need to be fixed for shallow water.

Depth-wise Metrics for Each Bin

Errors for depth bin prediction
Location of peak CHLA
Height of peak CHLA
Integrated CHLA from 0 to 200m

Absolute Errors for Depth Bins

Proportional Errors for Depth Bins

Peak Depth and Integrated CHLA Metrics (Code)

	peak_obs_m	peak_pred_m	peak_error_m	peak_obs_val	peak_pred_val	peak_error_val	int_obs_0_200	int_pred_0_200	int_error_0_200
1433	45.0	65.0	20.0	0.552569	0.269209	-0.283360	60.357637	29.869666	-30.487972
1200	45.0	45.0	0.0	0.422407	0.552620	0.130213	24.065508	38.637324	14.571816
4501	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4552	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1867	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

Peak Depth and Height

Integrated CHLA (0–200 m)

Toward Global Maps of CHLA Profiles

Once the per-depth-bin BRTs are trained:

Apply them to each PACE pixel with valid Rrs(λ).
Retrieve a full CHLA profile at each ocean location.
Derive summary products:
- Integrated CHLA 0–200 m
- Depth of maximum CHLA
- Biomass-weighted mean depth

This moves us toward estimates of global, depth-resolved CHLA density from space.