Match-up basics - Points and PACE

Match-up basics - Points and PACE#

Author: Ryan Vandermeulen and Eli Holmes (NOAA)
Last updated: January 9, 2026

Fisheries research is rich in point-based observations: survey tows, catch locations, plankton samples, animal tag detections, CTD casts, acoustics, underwater imagery, what have you… These observations carry valuable biological information, and we often want to relate them to broader environmental conditions. The challenge is matching datasets efficiently across space and time. It’s relatively straightforward to extract a point from a single satellite image, but going through multiple points and time stamps can be a major time-suck. So, let’s take it one step at a time and set up a workflow to help us along. We’re going to get some help from our friends at the Southeast Area Monitoring & Assessment Program, or SEAMAP. There is a lot of really cool fisheries data pertaining to the U.S. Gulf Coast in here (species, size, abundance, environmental parameters, and more) for us to experiment with.

Learning Objectives:

Import data from a NOAA Fisheries survey and show the locations on a satellite image

Extract PACE reflectance data across multiple corresponding dates/locations

Perform a rudimentary prediction model of fish abundance from the data

Are you ready?#

Let’s dive in! 🐠#

Sidebar: Why L3 Data for Fisheries Applications?#

Any optics nerds in the crowd may be scratching their heads as to why we’re using Level-3 (L3) data instead of Level-2 (L2), which is typically preferred for pixel-level matchups and algorithm development. That instinct is absolutely correct. If the goal is to reproduce or validate a biogeochemical parameter at a specific place and time, L2 is the right tool. L2 data preserve the highest spatial and temporal fidelity and are essential when developing or evaluating bio-optical algorithms.

Fisheries applications, however, often ask a different kind of question. Rather than predicting the exact biogeochemical state of the water at a precise moment, we are frequently interested in characterizing habitat, or the broader environmental context within which fish tend to occur. Unlike phytoplankton or dissolved constituents, fish are free-swimming organisms. They can move kilometers in a day, actively responding to environmental gradients rather than passively drifting with the water. As a result, the absolute value of a biogeochemical variable at a single pixel and instant is unlikely to be the sole (or even primary) driver of fish behavior. Instead, fish distributions often reflect spatially and temporally integrated conditions: recent productivity, water mass characteristics, fronts, or persistent habitat features.

In this context, L3 products — which represent spatially and/or temporally aggregated fields — are often well suited to the problem. They trade fine-scale precision for stability, coverage, and ecological relevance, aligning more naturally with the scales at which mobile organisms experience their environment. To be clear, this is not about algorithm development. Here, we are building models, not validating retrievals. Accepting a degree of spatial and temporal lag is not a limitation, it is an intentional assumption that reflects the biology of the system.

Back to our programming - How fast does this thing go?#

Let’s see how fast a single pixel can be extracted from the image. Then we’ll plot it up.

# See how fast this goes
%time

# Quick look at the water Reflectance 
pt = ds["Rrs"].sel(lat=lat, lon=lon, method="nearest")
pt.plot.line(x="wavelength")

CPU times: user 3 μs, sys: 0 ns, total: 3 μs
Wall time: 6.44 μs

[<matplotlib.lines.Line2D at 0x7fa76cdafa70>]

../../_images/5dbf5cc5dc2a4de06513e65341ffd2d9b9399e6bb3e1ec1befef308129a1275c.png

What if we have multiple points?#

Let’s just set up a simple loop to extact multiple points from a single image.

# Example list of points
points = [(30.00, -89.00), (30.30, -87.00), (29.5, -92.00)]

for i, (lat, lon) in enumerate(points, start=1):
    pt = ds["Rrs"].sel(lat=lat, lon=lon, method="nearest")
    pt.plot.line(x="wavelength", label=f"Point {i}")   
plt.ylabel("Rrs")
plt.legend()
plt.show()

../../_images/da5da5aba316945925fac0a246cef03a0387ef9ea2c91e562cfddcf54f4797e6.png

Real example with lots of points#

Realistically, you will be dealing with more than one in situ data point on more than one day. When we start to work across multiple time stamps, we increase the potential of running into memory issues. Let’s illustrate this with a real world example using fish survey data in the Gulf of Mexico, and show some strategies for dealing with satellite matchups over time.

Get survey points#

We are going to extract PACE satellite data at individual point locations, sampling single pixels across multiple time steps and spatial domains. As mentioned at the start of this tutorial, we got our paws on some ship-based NOAA Fisheries survey data from SEAMAP (2024–present). We’re going to extract PACE remote-sensing reflectance (Rrs) at each survey station, allowing us to sample satellite data at discrete point locations.

To keep things straightforward, we’ll start by importing a clean .csv file containing latitude, longitude, sampling date, as well as several in situ variables collected on the surveys, including temperature, salinity, oxygen, chlorophyll (fluorometry), turbidity, Secchi depth, and more. In addition, we want some form of fisheries data, so we also extracted the variable ‘TOT_LIVE’, which represents the combined wet weight (kg) of all fish, crustaceans, and other organisms caught. Is this the most meaningful fisheries metric available? Probably not, but it works well for illustrating the workflow, which is our main focus here. The SEAMAP dataset has plenty more to explore including species abundances and size information.

If you’d like to try this out with your own data, any .csv containing columns labeled ‘Latitude’, ‘Longitude’, and ‘Loc_date’ (formatted as MM/DD/YYYY) should work seamlessly with the code below.

import pandas as pd
# Load ship data
url = (
    "https://raw.githubusercontent.com/"
    "fish-pace/2025-tutorials/main/"
    "Supporting_files/SEAMAP_summary_2024_present.csv"
)
track_df = pd.read_csv(url)
print(track_df.columns.tolist())
track_df

['STATIONID', 'CRUISEID', 'Loc_date', 'DECSLAT', 'DECELAT', 'DECSLON', 'DECELON', 'Latitude', 'Longitude', 'DEPTH_SSTA', 'WIND_SPD', 'WIND_DIR', 'MIN_FISH', 'TOT_LIVE', 'FIN_CATCH', 'CRUS_CATCH', 'OTHR_CATCH', 'CLD_COVER', 'SECCHI_DSK', 'WECOLOR', 'DEPTH_EMID', 'DEPTH_EMAX', 'DEPTH_EWTR', 'TEMPSURF', 'TEMPMID', 'TEMPMAX', 'SALSURF', 'SALMID', 'SALMAX', 'CHLORSURF', 'CHLORMID', 'CHLORMAX', 'OXYSURF', 'OXYMID', 'OXYMAX', 'TURBSURF', 'TURBMID', 'TURBMAX']

	STATIONID	CRUISEID	Loc_date	DECSLAT	DECELAT	DECSLON	DECELON	Latitude	Longitude	DEPTH_SSTA	...	SALMAX	CHLORSURF	CHLORMID	CHLORMAX	OXYSURF	OXYMID	OXYMAX	TURBSURF	TURBMID	TURBMAX
0	235611	1139	6/13/2024	27.378	27.389	-82.725	-82.750	27.3835	-82.7375	11.0	...	35.72	NaN	NaN	NaN	5.5	5.6	5.5	NaN	NaN	NaN
1	235612	1139	6/14/2024	27.115	27.123	-82.725	-82.700	27.1190	-82.7125	20.1	...	36.31	NaN	NaN	NaN	5.7	5.8	5.7	NaN	NaN	NaN
2	235613	1139	6/14/2024	26.931	26.956	-82.819	-82.815	26.9435	-82.8170	28.9	...	36.46	NaN	NaN	NaN	5.9	5.9	5.9	NaN	NaN	NaN
3	235614	1139	6/14/2024	26.700	26.675	-82.806	-82.807	26.6875	-82.8065	30.9	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	235615	1139	6/14/2024	26.678	26.657	-82.637	-82.654	26.6675	-82.6455	25.2	...	36.48	NaN	NaN	NaN	5.8	5.8	5.7	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
590	236323	1150	6/14/2024	28.669	28.653	-91.198	-91.176	28.6610	-91.1870	10.2	...	36.18	5.547	4.828	7.123	5.8	6.0	4.3	78.5948	86.1352	29.5028
591	236324	1150	6/14/2024	28.897	28.901	-90.681	-90.653	28.8990	-90.6670	56.0	...	32.09	18.683	15.016	12.228	8.8	6.9	0.9	24.7280	63.2462	62.5680
592	236325	1150	6/14/2024	28.882	28.872	-90.535	-90.509	28.8770	-90.5220	91.0	...	35.81	12.675	8.132	9.457	7.8	3.8	0.2	43.5251	83.6422	65.8241
593	236326	1150	6/11/2024	28.513	28.494	-90.154	-90.136	28.5035	-90.1450	31.0	...	36.50	3.277	2.247	7.253	6.5	6.9	4.2	79.9584	86.4101	9.1834
594	236327	1150	6/12/2024	28.534	28.541	-91.694	-91.722	28.5375	-91.7080	23.6	...	36.72	4.243	4.423	5.362	4.5	4.6	5.9	88.5203	88.8865	72.4507

595 rows × 38 columns

track_df['TOT_LIVE']

    173.1
    142.0
     44.6
     56.5
    193.7
       ...  
   32.3
    0.7
    0.2
    NaN
    NaN
Name: TOT_LIVE, Length: 595, dtype: float64

The first order of business is to load the .csv file and identify all unique sampling dates. This is important because we’ll be retrieving data from NASA Earthdata, and it’s far more efficient to request each date only once rather than repeatedly downloading the same files.

# Load the dates and find all unique instances
track_df["Loc_date"] = pd.to_datetime(track_df["Loc_date"])
unique_dates = track_df["Loc_date"].sort_values().unique()
print(f"Number of unique dates to extract: {len(unique_dates)}")

Number of unique dates to extract: 61

Map it out#

Let’s take a look at where all these stations are, and “slice” our data down to the bounds of the Gulf, so we’re not taking up so much memory by loading in the entire planet (L3 data). For now, we’re just loading in the single time point that we defined at the beginning of the tutorial, just to get a lay of the land.

import cartopy.crs as ccrs
import cartopy.feature as cfeature

# Gulf bounds: (lat_max, lat_min, lon_min, lon_max)
ds_gulf = ds.sel(lat=slice(31, 18), lon=slice(-98, -80))
Rrs_443 = ds_gulf["Rrs"].sel(wavelength=443.0, method="nearest")

fig, ax = plt.subplots(figsize=(12, 10), subplot_kw={"projection": ccrs.PlateCarree()})
Rrs_443.plot(ax=ax, cmap="jet", add_colorbar=True, vmin=0, vmax=0.02)
ax.coastlines(resolution="50m")
ax.scatter(track_df["Longitude"], track_df["Latitude"], s=40, c="black", edgecolor="white", linewidth=0.5, label="Stations")
ax.legend(loc="upper right")
plt.show()

../../_images/e264156538c9a48a1484670eea13e4119fc03e8686ba9ed3e675eeb0e698401d.png

DAY or 8D?#

You have a bit of an existential decision to make here. At what temporal resolution will you extract your data? It will depend on your application, on your project goals, and the characteristics of your data. Trying to characterize general fish habitat preferences for highly migratory species? You can probably grab the closest 8-day average and make the assumption that most mesoscale features they interact with are coherent at weekly time scales. Looking at plankton tow information or counting fish species that are heavily reliant on tidal cycles? That data is likely a little more tied to the day it was collected, and you would want to look at daily values. The tradoff is that cloud cover is going to significantly chop down your data, and the weekly averages will help you get a lot more data back for model development. As an example, if ran the code below using 8-day versus 1-day composites, you would go from 532 valid matchups to 184 matchups, respectively. If your scientific question can tolerate loosening of the spatial/temporal time constraints, you may opt to use the longer composites - just be aware of your assumptions.

We’re not tracking tuna across the Pacific here, this is trawl data along shelf environments in a highly dynamic system. In this case, I might opt to keep those temporal constraints a little tighter, assuming that the location of e.g., river pulses or other submesoscale physical mixing processes are changing drastically from day to day.

Getting multiple matchups#

For every unique date found in the .csv, we’re going to query NASA for that corresponding daily Level 3 file. Then, for that file, we’re going to loop through all the points within that day and find the nearest lat/lon pixel and get the reflectance spectrum. We’re going to collect this info in a ‘dictionary’ and then append the data to our .csv in one swoop at the end.

We will use an approach that allows us to release memory after loading each dataset. This avoids stacking up memory and crashing your kernel. For one, we’re streaming via earthaccess.open(), which returns EarthAccessFile objects that stream the data on-demand. Only the chunks of data you actually access with xarray (ds["Rrs"].isel(...)) are read into memory. Secondly, using with statements ensures that datasets are closed immediately after processing. Once the block ends, memory used by ds is released, and we avoid keeping multiple files open at the same time.

Still, if we want to keep memory down (and prevent our kernel from crashing), we’re going to query satellite files one (unique) day at a time. This is the largest bottleneck. If you have the memory, it would likely be quicker to download the files all at once and then extract the data in one swoop. We prefer keeping in cloud space here. It doesn’t take but 2-4 minutes for this example, but now is a good time for a beverage/bio break. For the sake of this tutorial, feel free to skip this section, since we provide the output for the next section.

import earthaccess
import xarray as xr
import pandas as pd
import numpy as np
from tqdm import tqdm
import re

# Login
auth = earthaccess.login()
short_name = "PACE_OCI_L3M_RRS"
granule_name = "*.DAY.*.4km.*"
# Want 8D composites instead?
#granule_name = "*.8D.*.4km.*"
results_list = []

# Loop through unique dates
for date in tqdm(unique_dates, desc="Processing unique dates"):
    start = end = date.strftime("%Y-%m-%d")
    # Search & open files for exact date
    results = earthaccess.search_data(short_name=short_name, temporal=(start, end), granule_name=granule_name)
    fileset = earthaccess.open(results)

    for ea_file in fileset:  # iterate EarthAccessFile objects
        with xr.open_dataset(ea_file, engine="h5netcdf") as ds:
            points_today = track_df[track_df["Loc_date"] == date]
            
            for idx, row in tqdm(points_today.iterrows(), total=len(points_today), desc="Points in granule", leave=False):
                lat_center, lon_center = row["Latitude"], row["Longitude"]
                rrs = ds["Rrs"].sel(lat=lat_center, lon=lon_center, method="nearest")
                result_row = {'Original_Index': idx, **row.to_dict()}
                result_row.update({f"Rrs_{int(wl)}": val for wl, val in rrs.to_series().items()})
                results_list.append(result_row)

# Combine results & save
if results_list:
    final_df = pd.DataFrame(results_list).set_index("Original_Index").sort_index()
    output_file = "SEAMAP_summary_2024_present_PACE_RRS_1D_centerpixel.csv"
    final_df.to_csv(output_file)
    print(f"CSV saved: {output_file}")
else:
    print("No data extracted. Check date ranges and data availability.")

3x3 averages instead of one pixels#

In many L2 workflows, it is standard practice to extract a 3×3 or 5×5 pixel window rather than a single center pixel. This provides a mean and standard deviation that help characterize small-scale spatial heterogeneity, increase the likelihood of retrieving valid data, and reduce sensitivity to clouds or isolated bad pixels. Although L3 products have already been spatially binned (here to 4-km resolution), the same conceptual motivation can still apply. Depending on the science question, you may be less interested in the exact value at a single grid cell and more interested in the broader environmental context of the surrounding waters. For habitat-oriented fisheries applications, averaging over a small neighborhood can better represent the conditions a mobile organism may be experiencing, rather than assuming it responds to a single pixel value. Below, we show how the previous center-pixel extraction can be modified to compute a simple 3×3 pixel mean around each location.

import earthaccess
import xarray as xr
import pandas as pd
import numpy as np
from tqdm import tqdm
import re

# Login
auth = earthaccess.login()
# Provide product parameters
short_name = "PACE_OCI_L3M_RRS"
granule_name = "*.DAY.*.4km.*"

results_list = []

# Loop through unique dates
for date in tqdm(unique_dates, desc="Processing unique dates"):

    start = end = date.strftime("%Y-%m-%d")

    # Search & open files for exact date
    results = earthaccess.search_data(short_name=short_name, temporal=(start, end), granule_name=granule_name)
    fileset = earthaccess.open(results)

    for ea_file in fileset:  # iterate EarthAccessFile objects
        with xr.open_dataset(ea_file, engine="h5netcdf") as ds:
            points_today = track_df[track_df["Loc_date"] == date]
            lat_vals = ds.lat.values
            lon_vals = ds.lon.values
            
            for idx, row in tqdm(points_today.iterrows(), total=len(points_today), desc="Points in granule", leave=False):
                lat_center, lon_center = row["Latitude"], row["Longitude"]
                lat_idx = np.abs(lat_vals - lat_center).argmin()
                lon_idx = np.abs(lon_vals - lon_center).argmin()
                lat_slice = slice(max(lat_idx-1,0), min(lat_idx+2,len(lat_vals)))
                lon_slice = slice(max(lon_idx-1,0), min(lon_idx+2,len(lon_vals)))
    
                rrs_box = ds["Rrs"].isel(lat=lat_slice, lon=lon_slice)
                rrs_mean = rrs_box.mean(dim=["lat","lon"], skipna=True)

                # Prepare row dict
                result_row = {'Original_Index': idx, **row.to_dict()}
                result_row.update({f"Rrs_{int(wl)}": val for wl, val in rrs_mean.to_series().items()})
                results_list.append(result_row)

# Combine results & save
if results_list:
    final_df = pd.DataFrame(results_list).set_index("Original_Index").sort_index()
    output_file = "SEAMAP_summary_2024_present_PACE_RRS_1D_3x3mean.csv"
    final_df.to_csv(output_file)
    print(f"CSV saved: {output_file}")
else:
    print("No data extracted. Check date ranges and data availability.")

Okay, one more iteration (2D products)#

Fine, we’ll come down off of our Remote Sensing Reflectance box and show you how to extract single variables instead. Rrs is a 3D variable, while most other products are 2D variables (e.g., chlor_a, avw, poc, etc.). Same basic code, but we “flatten” the values before adding to the dictionary. Just modify the PRODUCT inputs to match your product of interest.

import earthaccess
import xarray as xr
import pandas as pd
import numpy as np
from tqdm import tqdm
import re

PRODUCT = {
    "short_name": "PACE_OCI_L3M_AVW",
    "var": "avw",          # variable name inside the dataset
    "label": "AVW",        # used in column names & output file
    "granule": "*.DAY.*.4km.*"
}

# Login
auth = earthaccess.login()
short_name = PRODUCT["short_name"]
granule_name = PRODUCT["granule"]
results_list = []

# Loop through unique dates
for date in tqdm(unique_dates, desc="Processing unique dates"):
    start = end = date.strftime("%Y-%m-%d")
    # Search & open files for exact date
    results = earthaccess.search_data(short_name=short_name, temporal=(start, end), granule_name=granule_name)
    fileset = earthaccess.open(results)

    for ea_file in fileset:  # iterate EarthAccessFile objects
        with xr.open_dataset(ea_file, engine="h5netcdf") as ds:
            points_today = track_df[track_df["Loc_date"] == date]
            
            for idx, row in tqdm(points_today.iterrows(), total=len(points_today), desc="Points in granule", leave=False):
                lat_center, lon_center = row["Latitude"], row["Longitude"]
                da = ds[PRODUCT["var"]].sel(lat=lat_center, lon=lon_center, method="nearest")
                result_row = {'Original_Index': idx, **row.to_dict()}
                values = np.atleast_1d(da.values).flatten()
                result_row.update({f"{PRODUCT['label']}_{i}": val for i, val in enumerate(values)})
                results_list.append(result_row)

# Combine results & save
if results_list:
    final_df = pd.DataFrame(results_list).set_index("Original_Index").sort_index()
    output_file = f"SEAMAP_summary_2024_present_PACE_{PRODUCT['label']}_centerpixel.csv"
    final_df.to_csv(output_file)
    print(f"CSV saved: {output_file}")
else:
    print("No data extracted. Check date ranges and data availability.")

Let’s explore a bit#

For every data point that we read in which had a lat/lon/date, we now have a corresponding Rrs spectrum (or whatever PACE products you chose to pull). For something as complex and nuanced as habitat modeling, we’re not likely to see a tight correlation between Rrs and total fish catch (“TOT_LIVE”), but it’s a good exercise to explore the relative significance of including PACE data. Before we do anything with this data, let’s just see what the average shape of Rrs looks like at various fish biomass levels. First, let’s open up that spreadsheet we just created:

import numpy as np
import pandas as pd
url = (
    "https://raw.githubusercontent.com/"
    "fish-pace/2025-tutorials/main/"
    "Supporting_files/SEAMAP_summary_2024_present_PACE_RRS_1D_3x3mean.csv"
)
df2 = pd.read_csv(url)

tot_live = df2["TOT_LIVE"].replace(0, np.nan)  # avoid log(0)
rrs_cols = [c for c in df2.columns if c.startswith("Rrs_")]
wavelengths = np.array([int(c.split("_")[1]) for c in rrs_cols])

print(f"Min value in TOT_LIVE: {np.min(tot_live)}")
print(f"Max value in TOT_LIVE: {np.max(tot_live)}")

Min value in TOT_LIVE: 0.2
Max value in TOT_LIVE: 456.7

Since we’re dealing with living things, and the biomass data extends across 3 orders of magnitude, let’s go ahead and log-transform the biomass data so that it normalizes the distribution a bit. Below, what this allows us to do is define a custom number of equally spaced bins (in log10 space), in which we can take the mean of all Rrs spectra within each bin, and plot it out. This is qualitative exploration at this point, but it feeds into our devious plot to demystify spectral reflectance data and get a general sense of patterns.

import matplotlib.pyplot as plt
import matplotlib.cm as cm

# Define log-space bins
safe_tot = tot_live.clip(lower=1e-6) #this avoids issues with zeros
num_bins = 10
bins_log = np.logspace(np.log10(safe_tot.min()), np.log10(safe_tot.max()), num_bins + 1)
df2["bin_log"] = pd.cut(safe_tot, bins_log, include_lowest=True)

# Compute mean spectra per bin
mean_spectra = df2.groupby("bin_log", observed=True)[rrs_cols].mean().dropna(how="all")

# Plot
cmap = plt.colormaps.get_cmap("viridis")
colors = cmap(np.linspace(0, 1, len(mean_spectra)))
plt.figure(figsize=(10,6))

for (bin_interval, row), color in zip(mean_spectra.iterrows(), colors):
    label = f"{bin_interval.left:.2f}-{bin_interval.right:.2f}"
    plt.plot(wavelengths, row.values, color=color, linewidth=2, label=label)

plt.xlabel("Wavelength (nm)")
plt.xlim(400, 715)
plt.ylabel("Mean Rrs (1/sr)")
plt.title("Mean Rrs Spectra Across log(TOT_LIVE) Bins")
plt.legend(title="TOT_LIVE bins (kg of catch)", bbox_to_anchor=(1.05,1), loc="upper left")
plt.grid(True)
plt.tight_layout()
plt.show()

../../_images/f90aa2113c0dffb9466845bbcbc81cdcee83ffac264239c01e6d1e08dd28e689.png

Yay! More squiggly lines!#

Generally speaking, catch biomass increases in greener, phytoplankton-dominated waters and decreases in bluer waters (no big surprise). Interestingly, we find an exception to this trend in the lowest biomass bin (0.43 - 0.94 kg) - this contains average green-shifted reflectance spectra, but the magnitude of total reflectance is significantly lower than the others. What’s the deal? As we discussed in the Rrs tutorial, the low magnitude likely suggests the presence of highly absorbing materials… CDOM plume perhaps? High CDOM levels tends to be a result of allochthonous inputs, i.e., river runoff. If this is the case, we’re likely to see lower salinity in these waters as well. That still doesn’t explain the low biomass though. Just for funzies, let’s take a look at some of those other environmental parameters collected using the same biomass bin parameters. We’ll pull surface temperature, surface salinity, bottom salinity, and bottom oxygen.

# Compute mean salinity + oxygen per bin
mean_vals = df2.groupby("bin_log", observed=True)[["TEMPSURF", "SALSURF", "SALMAX", "OXYMAX"]].mean()

# Print each bin with both means
for bin_interval, row in mean_vals.iterrows():
    print(
        f"bin interval: {bin_interval.left:.2e} - {bin_interval.right:.2e} --->    "
        f"mean TEMPSURF = {row['TEMPSURF']:.2f},    "
        f"mean SALSURF = {row['SALSURF']:.2f},    "
        f"mean SALBTM = {row['SALMAX']:.2f},    "
        f"mean OXYBTM = {row['OXYMAX']:.2f}"
    )

bin interval: 1.99e-01 - 4.33e-01 --->    mean TEMPSURF = 31.18,    mean SALSURF = 26.72,    mean SALBTM = 35.81,    mean OXYBTM = 0.20
bin interval: 4.33e-01 - 9.39e-01 --->    mean TEMPSURF = 30.51,    mean SALSURF = 24.22,    mean SALBTM = 32.94,    mean OXYBTM = 0.83
bin interval: 9.39e-01 - 2.04e+00 --->    mean TEMPSURF = 30.29,    mean SALSURF = 33.34,    mean SALBTM = 36.48,    mean OXYBTM = 6.30
bin interval: 2.04e+00 - 4.41e+00 --->    mean TEMPSURF = 29.04,    mean SALSURF = 34.51,    mean SALBTM = 36.14,    mean OXYBTM = 5.80
bin interval: 4.41e+00 - 9.56e+00 --->    mean TEMPSURF = 28.10,    mean SALSURF = 34.16,    mean SALBTM = 35.87,    mean OXYBTM = 5.43
bin interval: 9.56e+00 - 2.07e+01 --->    mean TEMPSURF = 29.07,    mean SALSURF = 33.90,    mean SALBTM = 36.32,    mean OXYBTM = 5.50
bin interval: 2.07e+01 - 4.49e+01 --->    mean TEMPSURF = 28.68,    mean SALSURF = 33.79,    mean SALBTM = 36.32,    mean OXYBTM = 5.54
bin interval: 4.49e+01 - 9.73e+01 --->    mean TEMPSURF = 28.25,    mean SALSURF = 34.16,    mean SALBTM = 35.51,    mean OXYBTM = 5.67
bin interval: 9.73e+01 - 2.11e+02 --->    mean TEMPSURF = 27.06,    mean SALSURF = 33.84,    mean SALBTM = 34.69,    mean OXYBTM = 5.89
bin interval: 2.11e+02 - 4.57e+02 --->    mean TEMPSURF = 27.15,    mean SALSURF = 34.62,    mean SALBTM = 35.66,    mean OXYBTM = 5.78

Holy Hypoxia!#

Yessiree Bob. \(\Delta\)Salinity between surface and bottom implies a freshwater lense has stratified the water column. With (presumably) high nutrients, low benthic light (no photosynthesis), and trapped bottom waters, we see the system has become hypoxic/borderline anoxic (low to no oxygen) owing to the cumulative effect of microbial respiration in waters that are not being ventilated to the surface. Classic “Dead Zone”. No wonder there are no fishies there! Anyhow, we’re getting away from our main point, but just by looking at the reflectance data patterns, we were able to initially flag that environmental parameters are a bit different here. Now back to our programming.

Oooooh, let’s make a correlation matrix!!#

What? It’s exciting. We have a lot of spectral data, what else can we tease out here? Again, we’re not expecting any sort of robust relationship here, but let’s keep using TOT_LIVE, and see if there are additional inferences that emerge between hyperspectral PACE reflectances and fish catch. First, let’s pull out all that new data we got running the matchup script.

# Extract Rrs bands + wavelengths between 400–700 nm 
rrs_cols = [c for c in df2.columns if c.startswith("Rrs_") and 400 <= float(c.split("_")[1]) <= 700]
rrs_cols = sorted(rrs_cols, key=lambda c: float(c.split("_")[1]))
wavelengths = np.array([float(c.split("_")[1]) for c in rrs_cols])

# Clipping outliers and zeros to avoid division errors
rrs = df2[rrs_cols].clip(lower=0.000001, upper=0.5)
rrs.columns = wavelengths  # ensure numeric labels

After taking a look at the mean hyperspectral Rrs across the different biomass catch bins, we want to dig a little deeper and be more quantitative. To do that, we’re going to play with a simple two-band algorithm approach, where we combine pairs of spectral bands using basic math. Focusing on two bands allows us to make a really useful plot that showcases the utility of hyperspectral data: a pairwise correlation matrix. This provides basic information on narrowband features that may be related to our given parameter of interest. We’re not ignoring the hyperspetral data, we’re actually leveraging it to optimize the best answer.

For this round, we’ll use a normalized difference index (NDI), mostly because it’s simple and kind of fun:

\(NDI = ({\lambda}_{1} - {\lambda}_{2}) / ({\lambda}_{1} + {\lambda}_{2})\)

There’s no particular scientific reason we have to use NDI here. For a pairwise correlation matrix, you could just as easily use any two-band mathematical combination, e.g. ratios, subtractions, exponents, etc. NDI is just a convenient, well-behaved choice to start with.

Using the hyperspectral bands we have, we’ll compute every possible two-band NDI combination. Then, for each band pair, we’ll look at how well that NDI correlates with our parameter of interest (in this case, TOT_LIVE). Plotting those correlation coefficients as a heatmap gives us a quick way to see which band combinations look most promising.

With over 100 bands at our disposal, the hyperspectral capability of PACE really opens things up, letting us test band combinations that simply haven’t been explored before. Fair warning: running all those combinations might take a minute.

import seaborn as sns

# Identify target variable
param = "TOT_LIVE"
target = np.log10(df2[param].clip(lower=1e-4))

# Set up NDI correlation matrix
ndi_corr = pd.DataFrame(index=wavelengths, columns=wavelengths, dtype=float)

# Loop NDI calculation + correlation with target
for wl1 in wavelengths:
    for wl2 in wavelengths:
        if wl1 == wl2:
            continue
        ndi = (rrs[wl1] - rrs[wl2]) / (rrs[wl1] + rrs[wl2])
        ndi_corr.at[wl1, wl2] = ndi.corr(target)

# Plot
mask = np.triu(np.ones_like(ndi_corr, dtype=bool))
cmax = max(0.1, np.nanmax(np.abs(ndi_corr.values)))

plt.figure(figsize=(10, 8))
sns.heatmap(
    ndi_corr, cmap="coolwarm", center=0,
    mask=mask | ndi_corr.isna().values,
    vmin=-cmax, vmax=cmax,
    xticklabels=50, yticklabels=50,
    cbar_kws={"label": f"Correlation with {param}"}
)
plt.title(f"Band Ratio Correlation with {param}", fontsize=16)
plt.xlabel("λ₂ (nm)")
plt.ylabel("λ₁ (nm)")
plt.tight_layout()
plt.show()

../../_images/48a9f6f1726720fe47cc35f63b29faaf5a1dc52192a9790134e301a1d846afca.png

Cool, cool. We’re kind of eye-balling it at this point, so let’s sort out what the 5 best combos were. It is likely that the first 5 combinations will be very close to one another due to cross-correlation of the nearby color bands, so you are welcome to put out a hundred values if you’d like (it’ll just be a ltitle messy).

# How many correlations do you want to see?
order_index = 5

# Flatten the correlation matrix and drop NaNs
ndi_flat = ndi_corr.stack().dropna()

# Sort by absolute correlation
top_5_ndi = ndi_flat.reindex(ndi_flat.abs().sort_values(ascending=False).index).head(order_index)

# Print the results
print(f"\nTop 5 NDI combinations by absolute correlation with {param}:")
for (wl1, wl2), corr in top_5_ndi.items():
    print(f"λ₁ = {int(wl1)} nm, λ₂ = {int(wl2)} nm --> r = {corr:.4f}")

Top 5 NDI combinations by absolute correlation with TOT_LIVE:
λ₁ = 400 nm, λ₂ = 647 nm --> r = -0.4245
λ₁ = 647 nm, λ₂ = 400 nm --> r = 0.4245
λ₁ = 648 nm, λ₂ = 400 nm --> r = 0.4236
λ₁ = 400 nm, λ₂ = 648 nm --> r = -0.4236
λ₁ = 400 nm, λ₂ = 646 nm --> r = -0.4233

Will PACE actually improve a rudimentary fish abundance model?#

As we’ve iterated a few times through this tutorial, we’re not expecting PACE alone to be a silver bullet for predicting fish abundance. That said, the top correlations aren’t abhorrent, so let’s see what happens when we take our top contender and combine it with in situ environmental variables collected on the ship (temperature, salinity, oxygen, chlorophyll, and turbidity).

To do this, we’ll use a boosted regression tree (BRT). A BRT is an ensemble method that builds lots of small decision trees and then combines them, with each new tree focusing on correcting the errors of the previous ones. This makes it good at capturing nonlinear relationships and interactions between predictors without needing to specify them upfront.

We’ll also use a train/test split (bootstrapping) so the model is fit on one subset of the data and evaluated on another, which helps reduce overfitting and gives us a better sense of how well the model might perform on unseen data. Let’s get things set up.

from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import r2_score, mean_squared_error

# Column names of the Rrs bands to combine
wavelength1 = "Rrs_400"
wavelength2 = "Rrs_647"

# Simple environmental predictors
ENV_VARS = ["TEMPSURF", "SALSURF", "OXYSURF", "TURBSURF", "CHLORSURF"]

df3 = df2[df2["TOT_LIVE"].notna()].copy()

# Create a new column in df for the ratio
Rrs_ratio = f"ndi_{wavelength1}_{wavelength2}"
df3[Rrs_ratio] = (df3[wavelength1] - df3[wavelength2]) / (df3[wavelength1] + df3[wavelength2])

predictors = [Rrs_ratio] + ENV_VARS
# To see what the model would look like without PACE, use:
#predictors = ENV_VARS
X = df3[predictors]
y = np.log10(df3["TOT_LIVE"])

# Get rid of NaNs, BRT gets mad otherwise
mask = X.notna().all(axis=1)
X = X[mask]
y = y[mask]

print(f"Predictors used: {predictors}")
print(f"Number of samples: {len(y)}")

Predictors used: ['ndi_Rrs_400_Rrs_647', 'TEMPSURF', 'SALSURF', 'OXYSURF', 'TURBSURF', 'CHLORSURF']
Number of samples: 97

You’ll notice that we only have 97 total samples that got a PACE return + had all the environmental parameters simultaneously collected. This is a bit of a spotty dataset, so take the results here with a grain of salt.

Boosted Regression Tree setup#

First, we’ll split the dataset into training and testing subsets using train_test_split. Here, test_size=0.2 means that 20% of the data are held out for testing, while the remaining 80% are used to train the model. The random_state=42 just fixes the random seed so the split is reproducible; using 42 is arbitrary (the answer to the universe and everything), but it ensures we get the same results every time we run the code. Remove random_state if you want the results to change with every run.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Next, we’ll set up a boosted regression tree using GradientBoostingRegressor. This model builds an ensemble of many shallow decision trees (n_estimators=800, max_depth=3), added sequentially so that each new tree focuses on correcting errors made by the previous ones. The small learning_rate=0.01 slows the learning process to improve generalization, while subsample=0.8 introduces randomness by training each tree on 80% of the data, which helps reduce overfitting. We again set random_state=42 to keep the model behavior reproducible.

brt = GradientBoostingRegressor(n_estimators=800, learning_rate=0.01, max_depth=3, subsample=0.8, random_state=42)
#brt = GradientBoostingRegressor(n_estimators=800, learning_rate=0.01, max_depth=3, subsample=0.8)
brt.fit(X_train, y_train)

GradientBoostingRegressor(learning_rate=0.01, n_estimators=800, random_state=42,
                          subsample=0.8)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

After fitting the model to the training data, we use it to predict the held-out test data. Model performance is then evaluated using Pearson’s correlation coefficient (r), the coefficient of determination (R²), the root mean squared error (RMSE), and the median absolute percent error (MdAPE) giving us a quick sense of how well the model captures both the strength of the relationship and the overall prediction error.

y_pred = brt.predict(X_test)

r = np.corrcoef(y_test, y_pred)[0, 1]
r2 = r2_score(y_test, y_pred)

# For error metrics, let's get back in linear space so we can interpret
y_test_lin = 10 ** y_test
y_pred_lin = 10 ** y_pred
rmse_lin = np.sqrt(mean_squared_error(y_test_lin, y_pred_lin))
mdape = np.median(np.abs((y_pred_lin - y_test_lin) / y_test_lin)) * 100

print(f"BRT Pearson r : {r:.3f}")
print(f"BRT R²        : {r2:.3f}")
print(f"BRT RMSE (linear space)   : {rmse_lin:.3f} kg")
print(f"BRT MdAPE (linear space)   : {mdape:.1f}%")

BRT Pearson r : 0.709
BRT R²        : 0.404
BRT RMSE (linear space)   : 44.898 kg
BRT MdAPE (linear space)   : 49.8%

How’d we do?#

Finally, let’s visualize how well the model is doing and see which predictors the BRT relied on most. We’ll make a scatter plot of observed vs. predicted values as well as a STAR plot of “feature” importance.

# Prepare data for STAR plot
importances = brt.feature_importances_
importances = importances / importances.sum()
labels = predictors
angles = np.linspace(0, 2 * np.pi, len(importances), endpoint=False)

# Close the loop for polar plot
angles = np.concatenate([angles, angles[:1]])
importances_plot = np.concatenate([importances, importances[:1]])

# Create side-by-side figure
fig = plt.figure(figsize=(12, 6))

# Scatter plot (left)
ax1 = fig.add_subplot(1, 2, 1)
ax1.scatter(y_test_lin, y_pred_lin, alpha=0.6, edgecolor="k")
ax1.plot([y_test_lin.min(), y_test_lin.max()], [y_test_lin.min(), y_test_lin.max()], "r--")
ax1.set_xlabel("Observed TOT_LIVE")
ax1.set_ylabel("Predicted TOT_LIVE")
ax1.set_title("BRT: Observed vs Predicted")
ax1.set_xscale("log")
ax1.set_yscale("log")
ax1.grid(True)

# Polar plot (right)
ax2 = fig.add_subplot(1, 2, 2, polar=True)
ax2.plot(angles, importances_plot, linewidth=2)
ax2.fill(angles, importances_plot, alpha=0.3)
ax2.set_theta_direction(-1)
ax2.set_theta_offset(np.pi / 2)
ax2.set_xticks(angles[:-1])
ax2.set_xticklabels(labels)
#ax2.set_title("BRT Feature Importance", pad=25)

plt.tight_layout()
plt.show()

../../_images/8bd803e7b9018f2a982aafaae00178406f884a5f052772113374ba4bda73d16b.png

On the left, we have a scatter plot comparing the BRT’s predicted values to the observed values of our target. On the right, the STAR (polar) plot shows the relative importance of each predictor. In a BRT, a variable’s importance reflects how much it contributed to reducing prediction error across all the trees. Higher values mean the model “relied on” that variable more. These values are normalized to sum to 1, so you can see at a glance which predictors had the biggest influence on the model, even when interactions and nonlinear effects are present.

Turns out the PACE NDI was the second most influential factor in the model, behind surface temperature. I wouldn’t hang your hat on predicting fish catch with this model, but we did seem to at least capture a directional trend, with a modest, but respectible statistical correlation. Again, keep in mind, we only had 97 points, so our test dataset only consists of 20 points. If you try running the same model above without the presence of PACE data, you’ll notice that the relationship is not as strong, but also that it picks up ~150 additional observations becuase it didn’t have to filter by co-occuring satellite data (lot’s of which was missing due to cloud cover).

Now the science begins!#

Overall, this tutorial illustrates a workflow that’s very common in satellite data work: matching spatially and temporally explicit observations to environmental data, exploring relationships, and trying out simple predictive models. The bigger takeaway isn’t the specific numbers or predictions, but rather the approach. We learned how to handle satellite matchups, extract relevant data, and start exploring how environmental and spectral variables relate to biological patterns. With more data and further tuning, these workflows could be extended to better understand habitat use, fishing hotspots, or other ecological questions. In short, we hope that the tools and concepts you practiced here provide a solid foundation for integrating remote sensing and fisheries science in your own work.

	loss loss: {'squared_error', 'absolute_error', 'huber', 'quantile'}, default='squared_error' Loss function to be optimized. 'squared_error' refers to the squared error for regression. 'absolute_error' refers to the absolute error of regression and is a robust loss function. 'huber' is a combination of the two. 'quantile' allows quantile regression (use `alpha` to specify the quantile). See :ref:`sphx_glr_auto_examples_ensemble_plot_gradient_boosting_quantile.py` for an example that demonstrates quantile regression for creating prediction intervals with `loss='quantile'`.	'squared_error'
	learning_rate learning_rate: float, default=0.1 Learning rate shrinks the contribution of each tree by `learning_rate`. There is a trade-off between learning_rate and n_estimators. Values must be in the range `[0.0, inf)`.	0.01
	n_estimators n_estimators: int, default=100 The number of boosting stages to perform. Gradient boosting is fairly robust to over-fitting so a large number usually results in better performance. Values must be in the range `[1, inf)`.	800
	subsample subsample: float, default=1.0 The fraction of samples to be used for fitting the individual base learners. If smaller than 1.0 this results in Stochastic Gradient Boosting. `subsample` interacts with the parameter `n_estimators`. Choosing `subsample < 1.0` leads to a reduction of variance and an increase in bias. Values must be in the range `(0.0, 1.0]`.	0.8
	criterion criterion: {'friedman_mse', 'squared_error'}, default='friedman_mse' The function to measure the quality of a split. Supported criteria are "friedman_mse" for the mean squared error with improvement score by Friedman, "squared_error" for mean squared error. The default value of "friedman_mse" is generally the best as it can provide a better approximation in some cases. .. versionadded:: 0.18	'friedman_mse'
	min_samples_split min_samples_split: int or float, default=2 The minimum number of samples required to split an internal node: - If int, values must be in the range `[2, inf)`. - If float, values must be in the range `(0.0, 1.0]` and `min_samples_split` will be `ceil(min_samples_split * n_samples)`. .. versionchanged:: 0.18 Added float values for fractions.	2
	min_samples_leaf min_samples_leaf: int or float, default=1 The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least ``min_samples_leaf`` training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. - If int, values must be in the range `[1, inf)`. - If float, values must be in the range `(0.0, 1.0)` and `min_samples_leaf` will be `ceil(min_samples_leaf * n_samples)`. .. versionchanged:: 0.18 Added float values for fractions.	1
	min_weight_fraction_leaf min_weight_fraction_leaf: float, default=0.0 The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided. Values must be in the range `[0.0, 0.5]`.	0.0
	max_depth max_depth: int or None, default=3 Maximum depth of the individual regression estimators. The maximum depth limits the number of nodes in the tree. Tune this parameter for best performance; the best value depends on the interaction of the input variables. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples. If int, values must be in the range `[1, inf)`.	3
	min_impurity_decrease min_impurity_decrease: float, default=0.0 A node will be split if this split induces a decrease of the impurity greater than or equal to this value. Values must be in the range `[0.0, inf)`. The weighted impurity decrease equation is the following:: N_t / N * (impurity - N_t_R / N_t * right_impurity - N_t_L / N_t * left_impurity) where ``N`` is the total number of samples, ``N_t`` is the number of samples at the current node, ``N_t_L`` is the number of samples in the left child, and ``N_t_R`` is the number of samples in the right child. ``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum, if ``sample_weight`` is passed. .. versionadded:: 0.19	0.0
	init init: estimator or 'zero', default=None An estimator object that is used to compute the initial predictions. ``init`` has to provide :term:`fit` and :term:`predict`. If 'zero', the initial raw predictions are set to zero. By default a ``DummyEstimator`` is used, predicting either the average target value (for loss='squared_error'), or a quantile for the other losses.	None
	random_state random_state: int, RandomState instance or None, default=None Controls the random seed given to each Tree estimator at each boosting iteration. In addition, it controls the random permutation of the features at each split (see Notes for more details). It also controls the random splitting of the training data to obtain a validation set if `n_iter_no_change` is not None. Pass an int for reproducible output across multiple function calls. See :term:`Glossary `.	42
	max_features max_features: {'sqrt', 'log2'}, int or float, default=None The number of features to consider when looking for the best split: - If int, values must be in the range `[1, inf)`. - If float, values must be in the range `(0.0, 1.0]` and the features considered at each split will be `max(1, int(max_features * n_features_in_))`. - If "sqrt", then `max_features=sqrt(n_features)`. - If "log2", then `max_features=log2(n_features)`. - If None, then `max_features=n_features`. Choosing `max_features < n_features` leads to a reduction of variance and an increase in bias. Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than ``max_features`` features.	None
	alpha alpha: float, default=0.9 The alpha-quantile of the huber loss function and the quantile loss function. Only if ``loss='huber'`` or ``loss='quantile'``. Values must be in the range `(0.0, 1.0)`.	0.9
	verbose verbose: int, default=0 Enable verbose output. If 1 then it prints progress and performance once in a while (the more trees the lower the frequency). If greater than 1 then it prints progress and performance for every tree. Values must be in the range `[0, inf)`.	0
	max_leaf_nodes max_leaf_nodes: int, default=None Grow trees with ``max_leaf_nodes`` in best-first fashion. Best nodes are defined as relative reduction in impurity. Values must be in the range `[2, inf)`. If None, then unlimited number of leaf nodes.	None
	warm_start warm_start: bool, default=False When set to ``True``, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just erase the previous solution. See :term:`the Glossary `.	False
	validation_fraction validation_fraction: float, default=0.1 The proportion of training data to set aside as validation set for early stopping. Values must be in the range `(0.0, 1.0)`. Only used if ``n_iter_no_change`` is set to an integer. .. versionadded:: 0.20	0.1
	n_iter_no_change n_iter_no_change: int, default=None ``n_iter_no_change`` is used to decide if early stopping will be used to terminate training when validation score is not improving. By default it is set to None to disable early stopping. If set to a number, it will set aside ``validation_fraction`` size of the training data as validation and terminate training when validation score is not improving in all of the previous ``n_iter_no_change`` numbers of iterations. Values must be in the range `[1, inf)`. See :ref:`sphx_glr_auto_examples_ensemble_plot_gradient_boosting_early_stopping.py`. .. versionadded:: 0.20	None
	tol tol: float, default=1e-4 Tolerance for the early stopping. When the loss is not improving by at least tol for ``n_iter_no_change`` iterations (if set to a number), the training stops. Values must be in the range `[0.0, inf)`. .. versionadded:: 0.20	0.0001
	ccp_alpha ccp_alpha: non-negative float, default=0.0 Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller than ``ccp_alpha`` will be chosen. By default, no pruning is performed. Values must be in the range `[0.0, inf)`. See :ref:`minimal_cost_complexity_pruning` for details. See :ref:`sphx_glr_auto_examples_tree_plot_cost_complexity_pruning.py` for an example of such pruning. .. versionadded:: 0.22	0.0