Creating optical water types from hyperspectral PACE data#

Author: Ryan Vandermeulen (NOAA)
Last updated: October 16, 2025

Colab Badge JupyterHub Badge Download Badge JupyterHub

Introduction#

In the previous tutorial, we explored how to analyze sensitive metrics of change in the ocean using R\(_{rs}\) and derived indices.

We’re going to stay on that same theme, but take a slightly different approach. Optical water type (OWT) classification provides a way to isolate distinct reflectance signatures and track them across time and space. A few years ago, our astute NOAA colleague Jianwei Wei compiled a large database of in situ and simulated hyperspectral reflectance data and, using k-means clustering, statistically determined that marine water masses around the globe generally fall into 23 distinct optical water classes (see below). You can also check it out in Wei et al. (2022).

23 optical water types

Using the average shape of these optical water classes as a reference point, we can essentially “match” our hyperspectral PACE data at every pixel, and assign it to one of those 23 shapes. The key advantage here is that we can put our data into a rigid and consistent classification system that clearly demarks areas with specific optical features. It’s like being on a kitchy tour group and having everyone wear matching bright t-shirts so you can find one another… sort of. Note, this is not currently a distributed PACE data product, but it has been generating interest among our scientists at NOAA Fisheries, so now, we extend this capability to you all!

So here’s the plan. We’re going to load those 23 distinct reference spectra and try our hand at a very simple, but effective, spectral matching technique. We will then assign each pixel in a PACE image to a water class number and voila! - We will have the ability to generate a mapped image of hyperspectral optical water types.

Let’s dive in and mess around with some PACE data then. By the end of this section, you will have:

  1. Calculated the “cosine distance” between a spectral reference library against PACE data

  2. Assigned Optical Water Types (OWT) from PACE data and made a map

  3. Compared PACE spectra against the reference OWT spectra

Are you ready?#

Let’s go get some data first#

import earthaccess
import xarray as xr
auth = earthaccess.login()
# are we authenticated?
if not auth.authenticated:
    # ask for credentials and persist them in a .netrc file
    auth.login(strategy="interactive", persist=True)

Note, just to get a feel for the differences in data handling, let’s explore a standard mapped Level-3 file instead of the individual satellite scene (Level-2) that we explored in the last tutorial. There are small differences that will make you want to bang your head agaisnt the wall if you’re not familiar with them. I have banged my head for you, and now I share what has been learned:

results = earthaccess.search_data(
    short_name = "PACE_OCI_L3M_RRS",
    temporal = ("2024-09-01", "2024-09-29"),
    granule_name="*.MO.*.0p1deg.*"
)

A few key differences to note:

  • short_name: Product based (e.g. RRS, CHL) in Level-3, not the product suite (e.g., AOP, BGC) like in Level-2.

  • bounding_box: Not necessary at this stage, as Level-3 files are inherently global

  • granule_name: .YR.,.MO.,.8D.,.DAY. for time; .0p1deg., .9KM., .4KM. for spatial

Alright, let’s open this up! Libraries first!

# Load Libraries 
import numpy as np
import xarray as xr
import pandas as pd
import cartopy
import cartopy.crs as ccrs
from matplotlib import colors
import matplotlib.pyplot as plt
from matplotlib.lines import Line2D
from matplotlib.patches import Patch
from matplotlib.colors import ListedColormap, BoundaryNorm
from scipy.integrate import trapezoid
# Create a fileset
fileset = earthaccess.open(results);

# Load relevant datasets
dataset = xr.open_dataset(fileset[0])

# Extract hyperspectral Rrs
rrs_array = dataset['Rrs']  # shape: (wavelength, lat, lon)
wavelengths = rrs_array['wavelength'].values  # should be in nm

# Spatial Subset (optional)
lat_bounds = slice(46, 35)
lon_bounds = slice(-78, -63)
rrs_subset = rrs_array.sel(lat=lat_bounds, lon=lon_bounds)

Subsetting#

To make things more efficient, let’s subset a region of interest around U.S. side of the North Atlantic ocean.

# Spatial Subset (optional)
lat_bounds = slice(46, 35)
lon_bounds = slice(-78, -63)
rrs_subset = rrs_array.sel(lat=lat_bounds, lon=lon_bounds)

Let’s make a match!#

Here, we bring in our 23 reference spectra from Wei et al. (2022). They are provided at 1 nm resolution, while PACE is at ~2.5 nm resolution, so first we’ll have to extract and match those to the PACE wavelengths. Then we’ll proceed to make the cosine distance calculation to find the closest spectral match. Without going into too much detail, the cosine distance measures the magnitude of the dot product between the PACE spectrum and the 23 reference spectra - the lowest value = the closest match.

# Load reference spectra
nRrs_ref = pd.read_csv('/home/jovyan/hyperdawg/tutorials/python/Supporting_files/Hyperspectral_nRrs.csv', index_col=0)

# Match satellite and reference wavelengths
sat_wavelengths_str = [str(int(w)) for w in wavelengths]
ref_columns = nRrs_ref.columns
common_wavelengths = sorted(set(sat_wavelengths_str).intersection(ref_columns), key=int)
print(f"Using {len(common_wavelengths)} common wavelengths: {common_wavelengths}")

# Extract matched reference spectra
ref = nRrs_ref[common_wavelengths].values

# Subset satellite reflectance to common wavelengths
common_wavelengths_float = [float(w) for w in common_wavelengths]
rrs_subset_common = rrs_subset.sel(wavelength=common_wavelengths_float)

# Normalize satellite data
norm = np.sqrt((rrs_subset_common ** 2).sum(dim='wavelength'))
rrs_norm = rrs_subset_common / norm

# Normalize reference spectra
ref_norm = ref / np.linalg.norm(ref, axis=1, keepdims=True)

# Flatten satellite data
rrs_flat = rrs_norm.stack(pix=('lat', 'lon')).transpose('pix', 'wavelength').values

# Compute cosine distance
dot_products = np.dot(rrs_flat, ref_norm.T)
cosine_distances = 1 - dot_products

# Assign water class with smallest distance (index + 1 for 1-based)
mask= np.where(~np.isnan(cosine_distances).all(axis=1))[0] #create a mask of all NON- nan values (don't want to classify NaNs)
class_ids = np.full((cosine_distances.shape[0],),np.nan) # create an array of nan values in the desired output shape

# the first index is [0] so we add +1 since the first water class is "1"
class_ids[mask] = np.argmin(cosine_distances[mask],axis=1)+1 # fill in NON-nan values of class ID using the mask
#class_ids = np.argmin(cosine_distances, axis=1) + 1 # the first index is [0] so we add +1 since the first water class is "1"

# Reshape to 2D map
lat_dim, lon_dim = rrs_norm.sizes['lat'], rrs_norm.sizes['lon']
class_map = xr.DataArray(
    class_ids.reshape(lat_dim, lon_dim),
    dims=('lat', 'lon'),
    coords={'lat': rrs_norm['lat'], 'lon': rrs_norm['lon']},
    name='water_class'
)
Using 135 common wavelengths: ['403', '405', '408', '410', '413', '415', '418', '420', '422', '425', '427', '430', '432', '435', '437', '440', '442', '445', '447', '450', '452', '455', '457', '460', '462', '465', '467', '470', '472', '475', '477', '480', '482', '485', '487', '490', '492', '495', '497', '500', '502', '505', '507', '510', '512', '515', '517', '520', '522', '525', '527', '530', '532', '535', '537', '540', '542', '545', '547', '550', '553', '555', '558', '560', '563', '565', '568', '570', '573', '575', '578', '580', '583', '586', '588', '613', '615', '618', '620', '623', '625', '627', '630', '632', '635', '637', '640', '641', '642', '643', '645', '646', '647', '648', '650', '651', '652', '653', '655', '656', '657', '658', '660', '661', '662', '663', '665', '666', '667', '668', '670', '671', '672', '673', '675', '676', '677', '678', '679', '681', '682', '683', '684', '686', '687', '688', '689', '691', '692', '693', '694', '696', '697', '698', '699']

Mappy time#

Now, let’s make a map of the optical water type for our single monthly image. Some more important distinctions here: In Level-3 files, “lat” is latitude and “lon” is longitude. These are actually 1D arrays, since the Level-3 files are on a standard mapped grid by which a straight horizontal line = parellel and straight vertical line = meridian. No biggy, but if you try to lift code from reading in a Level-2 file, it will get a little angry without some minor swapouts.

plt.figure(figsize=(10, 8))
ax = plt.axes(projection=ccrs.PlateCarree())

# Define a colormap with 23 distinct colors (you can customize this list)
colors = plt.cm.jet(np.linspace(0, 1, 23))  # 23 colors from the 'jet' colormap
cmap = ListedColormap(colors)

# Create a BoundaryNorm to discretize the colorbar and stretch it across all 23 classes
# Note: If you have 23 classes, you need 24 boundaries because each class is an interval between two boundaries.
boundaries = np.arange(1, 25)  # 1 through 24 boundaries, plus one extra boundary for the last class
norm = BoundaryNorm(boundaries, ncolors=23, clip=True)

# Plot the class_map with discretized colors
c = ax.pcolormesh(class_map.lon, class_map.lat, class_map,cmap=cmap, shading='auto', norm=norm)
ax.coastlines()
ax.gridlines(draw_labels={"bottom": "x", "left": "y"})
cbar = plt.colorbar(c, ax=ax, label='Water Class')
cbar.set_ticks(np.arange(1, 24))  # Show ticks for whole numbers (1 through 23)
cbar.ax.invert_yaxis()  # Invert colorbar to match class order (if desired)
plt.title('Water Class Map')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.show()
../../_images/7b6d4be272bf65ccec783547a75aa167a8c6977e8506f5592d54c745c968eb14.png

But what does it all mean?#

Let’s put some context to these water classes. First, we’re going to take the mean + std of all the PACE spectra that fall within each specific optical water class. Basically, we’ll create a mask, and then pull out all the PACE spectra that meet certain criteria.

# Ensure number of classes matches your setup
n_classes = 23
n_bands = len(common_wavelengths)

# Reuse your class map color scheme
colors = plt.cm.jet(np.linspace(0, 1, n_classes))
cmap = ListedColormap(colors)
boundaries = np.arange(1, n_classes + 2)
norm = BoundaryNorm(boundaries, ncolors=n_classes, clip=True)

# First "flatten" the arrays for mean + std calculation prep
rrs_flat = rrs_norm.stack(pix=('lat', 'lon')).transpose('pix', 'wavelength').values
class_vals = class_map.stack(pix=('lat', 'lon')).values  

Just a quick pause, becuase “flattening” confused me a bit at first. Here’s what is going on:

rrs_norm originally has shape (wavelength, lat, lon) rrs_norm.stack(pix=(‘lat’, ‘lon’)) combines the lat and lon dimensions into a single new dimension called pix. Now the shape is (wavelength, pix) .transpose(‘pix’, ‘wavelength’) rearranges it to (pix, wavelength) — now each row is a pixel, each column is a wavelength .values converts the xarray DataArray into a plain NumPy array.

After stacking/flattening, you can iterate over pixels or use vectorized operations easily without worrying about 2D/3D indexing. It makes the subsequent calculations more efficient, as opposed to looping over every pixel. Okay, back to our program:

# Compute empty arrays to fill in
mean_spectra = np.full((n_classes, n_bands), np.nan)
std_spectra = np.full((n_classes, n_bands), np.nan)

# Compute mean spectra + std from satellite
for class_id in range(1, n_classes + 1):
    mask = (class_vals == class_id)
    if np.any(mask):
        spectra = rrs_flat[mask]  # shape (n_pixels_in_class, bands)
        mean_spectra[class_id - 1] = np.nanmean(spectra, axis=0)
        std_spectra[class_id - 1] = np.nanstd(spectra, axis=0)

Comparing PACE data with reference data#

Finally, we’re going to compare the mean satellite data spectral shapes against the reference library. This is a bit of a sanity check, but also allows you to get an idea of the variance within each class.

wavelengths_plot = np.array(common_wavelengths_float)

# Setup subplot grid
n_classes = 23
cols = 5
rows = int(np.ceil(n_classes / cols))
fig, axes = plt.subplots(rows, cols, figsize=(cols * 3, rows * 2.5), sharex=True, sharey=True)
axes = axes.flatten()

# Plotting loop
for class_id in range(1, 24):
    ax = axes[class_id - 1]
    color = colors[class_id - 1]
    
    ref_spec = ref_norm[class_id - 1]
    sat_spec = mean_spectra[class_id - 1]
    sat_std = std_spectra[class_id - 1]
    
    if not np.any(np.isnan(sat_spec)):
        ax.fill_between(
            wavelengths_plot,
            sat_spec - sat_std,
            sat_spec + sat_std,
            color='lightgray',
            alpha=0.9,
            label='±1 Std. Dev.'
        )
    ax.plot(wavelengths_plot, ref_spec, '--', color='black', linewidth=1.2, label='Reference')
    ax.plot(wavelengths_plot, sat_spec, '-', color=color, linewidth=1.2, label='Satellite')
    ax.set_ylim(bottom=0)
    ax.set_title(f'Class {class_id}', fontsize=10)
    ax.grid(True)
    ax.label_outer()

# Remove unused subplots and use one for legend
for i in range(n_classes, len(axes)):
    fig.delaxes(axes[i])

# Create a dummy legend in the bottom-right corner (last empty subplot position)
legend_ax = fig.add_subplot(rows, cols, cols * rows)  # last cell
legend_ax.axis('off')  # Hide axis

# Legend elements
legend_lines = [
    Line2D([0], [0], linestyle='--', color='black', label='Reference'),
    Line2D([0], [0], linestyle='-', color='gray', label='Satellite Mean'),
    Patch(facecolor='lightgray', edgecolor='none', label='±1 Std. Dev.')
]

legend_ax.legend(handles=legend_lines, loc='center', fontsize=10, frameon=False)

# Global labels
fig.suptitle("Reference vs. Satellite Normalized Spectra by Water Class", fontsize=14)
fig.supxlabel("Wavelength (nm)")
fig.supylabel("Normalized Rrs")
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()
../../_images/6fa85f9fc3a6def2d9954dae63253b68f1736ff36c59477c10874222a0526f45.png

Bring it full circle#

Just for giggles, let’s compute AVW by “hand” and compare it to the water mass classification so we can compare the two. Note, it is a pretty small lift to just compute AVW, if you ever want to avoid reading it in as a separate product. By default, it uses the visible range of the spectrum (400 - 700 nm), as does the OWT classification. One advatnage of calculating AVW independently is that you can set your own custom range of wavelengths if you are interested in extending the spectrum (e.g. 350 - 710).

# Subset wavelengths
rrs_subset_range = rrs_subset.sel(wavelength=slice(400, 700))

# Compute AVW: sum(Rrs) / sum(Rrs / wavelength)
avw = rrs_subset_range.sum(dim='wavelength') / (rrs_subset_range / rrs_subset_range['wavelength']).sum(dim='wavelength')
# User-defined AVW range
avw_min = 450
avw_max = 600

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 7), subplot_kw={'projection': ccrs.PlateCarree()})

# --- Class map ---
colors = plt.cm.jet(np.linspace(0, 1, 23))
cmap_classes = ListedColormap(colors)
boundaries = np.arange(1, 25)  # 23 classes → 24 boundaries
norm_classes = BoundaryNorm(boundaries, ncolors=23, clip=True)

c1 = ax1.pcolormesh(class_map.lon, class_map.lat, class_map,
                    cmap=cmap_classes, norm=norm_classes, shading='auto')
ax1.coastlines()
ax1.set_title('Optical Water Class')
cbar1 = plt.colorbar(c1, ax=ax1, ticks=np.arange(1, 24), fraction=0.035, pad=0.03)
cbar1.set_label('Water Class')
cbar1.ax.invert_yaxis()

# --- AVW map ---
c2 = ax2.pcolormesh(avw.lon, avw.lat, avw,
                    cmap='jet', vmin=avw_min, vmax=avw_max, shading='auto')
ax2.coastlines()
ax2.set_title('Apparent Visible Wavelength (AVW)')
cbar2 = plt.colorbar(c2, ax=ax2, fraction=0.035, pad=0.03)
cbar2.set_label('AVW (nm)')

plt.show()
../../_images/67d36c36c77be7f203817317934a4bb3b040a0db4123dff0877c4ca4229e72d2.png

You can that the maps largely mirror one another, but each have distinct utility. As a quick guide:

  • OWTs give you a more concrete classification of spectral shape across the globe. For example, the R\(_{rs}\) spectra contained in “Class 16” will more precisely resemble one another across multiple water types relative to conditions where “AVW = 550 nm.”

  • AVW is gradational in time/space and has physical units, so it would be more useful, for example, in a time-series analysis where you can say that color is changing by XX nm / year. Given the nature of the AVW calculation as a weighted harmonic mean, a higher/lower shift in AVW values unequivicoally represents a proportional shift in the weighted “color” of the water, regardless of absolute spectral shape.

That’s a wrap!#

There you have it! You are now an OWT + AVW expert. You are now entitled to cut to the front of the line in all ocean color related events. You may raise your chin high and look down from your perch of wisdom to the others who do not possess the power of water mass classification and indexing. Enjoy the rest of your day, majesty.