Get Bio-ARGO data

Get Bio-ARGO data#

Author: Eli Holmes (NOAA)
Last updated: Jan 29, 2026

📘 Learning Objectives

Get Argo data using the argopy Python package

Learn the basic format of the data grouped into platform and cycles

Learn how to filter to just surface data

Learn how to get data in batches

Biogeochemical Argo (“Bio-Argo” or BGC-Argo) floats are autonomous profiling instruments that drift with ocean currents and periodically dive from the surface to depths of 1,000–2,000 meters, collecting a vertical profile of physical and biogeochemical properties. These floats are all over the world’s oceans and the data are freely available via the Argo global data assembly centers (GDACs) Video on the Argo system.

Each float has a unique PLATFORM_NUMBER, and every time it dives and returns to the surface it produces a new profile, identified by its CYCLE_NUMBER. A single float may produce hundreds of profiles over several years. Bio-Argo floats carry optical, chemical, and physical sensors that measure variables such as chlorophyll-a (CHLA), temperature (TEMP), salinity (PSAL), pressure (PRES, which is used as depth), dissolved oxygen (DOXY), and nitrate (NITRATE). These data are distributed in a consistent, profile-based format: each profile (platform+cycle) contains measurements at multiple depths, along with time, latitude, longitude, and quality-control flags. To use these data for surface-matching with satellite products, we will extract the shallow measurements from each profile (pressure is less than 20dbar or 10dbar). We will use the argopy package to download Argo data from the Coriolis Argo Global Assembly Center.

Argo float cycle

Workflow#

Here is our basic workflow.

Use argopy to fetch data for Bio-Argo profiles in a given region and time
For each profile, save one point, the shallowest reading.
Process the globe in monthly chunks to not overload the ERDDAP server and save the monthly shallow points to a parquet file.

!pip install argopy

Collecting argopy
  Downloading argopy-1.4.0-py3-none-any.whl.metadata (7.8 kB)
Collecting xarray<=2025.9.0,>=2025.7 (from argopy)
  Downloading xarray-2025.9.0-py3-none-any.whl.metadata (12 kB)
Requirement already satisfied: scipy>=1.5 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from argopy) (1.16.3)
Requirement already satisfied: numpy in /srv/conda/envs/notebook/lib/python3.12/site-packages (from argopy) (2.3.5)
Requirement already satisfied: erddapy>=2.2.1 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from argopy) (2.3.0)
Requirement already satisfied: netCDF4>=1.5.3 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from argopy) (1.7.2)
Requirement already satisfied: h5netcdf>=1.4.0 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from argopy) (1.7.3)
Requirement already satisfied: fsspec>=2025.5.1 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from argopy) (2025.12.0)
Requirement already satisfied: toolz>=0.8.2 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from argopy) (1.1.0)
Requirement already satisfied: requests>=2.28 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from argopy) (2.32.5)
Collecting aiohttp<=3.12.15,>=3.7 (from argopy)
  Downloading aiohttp-3.12.15-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.7 kB)
Requirement already satisfied: decorator>=5.1 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from argopy) (5.2.1)
Requirement already satisfied: packaging>=20.4 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from argopy) (25.0)
Requirement already satisfied: aiohappyeyeballs>=2.5.0 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from aiohttp<=3.12.15,>=3.7->argopy) (2.6.1)
Requirement already satisfied: aiosignal>=1.4.0 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from aiohttp<=3.12.15,>=3.7->argopy) (1.4.0)
Requirement already satisfied: attrs>=17.3.0 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from aiohttp<=3.12.15,>=3.7->argopy) (25.4.0)
Requirement already satisfied: frozenlist>=1.1.1 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from aiohttp<=3.12.15,>=3.7->argopy) (1.7.0)
Requirement already satisfied: multidict<7.0,>=4.5 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from aiohttp<=3.12.15,>=3.7->argopy) (6.7.0)
Requirement already satisfied: propcache>=0.2.0 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from aiohttp<=3.12.15,>=3.7->argopy) (0.3.1)
Requirement already satisfied: yarl<2.0,>=1.17.0 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from aiohttp<=3.12.15,>=3.7->argopy) (1.22.0)
Requirement already satisfied: pandas>=2.2 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from xarray<=2025.9.0,>=2025.7->argopy) (2.3.3)
Requirement already satisfied: idna>=2.0 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from yarl<2.0,>=1.17.0->aiohttp<=3.12.15,>=3.7->argopy) (3.11)
Requirement already satisfied: typing-extensions>=4.2 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from aiosignal>=1.4.0->aiohttp<=3.12.15,>=3.7->argopy) (4.15.0)
Requirement already satisfied: httpx>=0.25.0 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from erddapy>=2.2.1->argopy) (0.28.1)
Requirement already satisfied: pytz in /srv/conda/envs/notebook/lib/python3.12/site-packages (from erddapy>=2.2.1->argopy) (2025.2)
Requirement already satisfied: python-dateutil>=2.8.2 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from pandas>=2.2->xarray<=2025.9.0,>=2025.7->argopy) (2.9.0.post0)
Requirement already satisfied: tzdata>=2022.7 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from pandas>=2.2->xarray<=2025.9.0,>=2025.7->argopy) (2025.3)
Requirement already satisfied: h5py in /srv/conda/envs/notebook/lib/python3.12/site-packages (from h5netcdf>=1.4.0->argopy) (3.13.0)
Requirement already satisfied: anyio in /srv/conda/envs/notebook/lib/python3.12/site-packages (from httpx>=0.25.0->erddapy>=2.2.1->argopy) (4.12.0)
Requirement already satisfied: certifi in /srv/conda/envs/notebook/lib/python3.12/site-packages (from httpx>=0.25.0->erddapy>=2.2.1->argopy) (2025.11.12)
Requirement already satisfied: httpcore==1.* in /srv/conda/envs/notebook/lib/python3.12/site-packages (from httpx>=0.25.0->erddapy>=2.2.1->argopy) (1.0.9)
Requirement already satisfied: h11>=0.16 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from httpcore==1.*->httpx>=0.25.0->erddapy>=2.2.1->argopy) (0.16.0)
Requirement already satisfied: cftime in /srv/conda/envs/notebook/lib/python3.12/site-packages (from netCDF4>=1.5.3->argopy) (1.6.5)
Requirement already satisfied: six>=1.5 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from python-dateutil>=2.8.2->pandas>=2.2->xarray<=2025.9.0,>=2025.7->argopy) (1.17.0)
Requirement already satisfied: charset_normalizer<4,>=2 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from requests>=2.28->argopy) (3.4.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from requests>=2.28->argopy) (2.5.0)
Downloading argopy-1.4.0-py3-none-any.whl (1.7 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 15.4 MB/s  0:00:00
?25hDownloading aiohttp-3.12.15-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 22.5 MB/s  0:00:00
?25hDownloading xarray-2025.9.0-py3-none-any.whl (1.3 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.3/1.3 MB 26.1 MB/s  0:00:00
?25hInstalling collected packages: aiohttp, xarray, argopy
  Attempting uninstall: aiohttp
    Found existing installation: aiohttp 3.13.2
    Uninstalling aiohttp-3.13.2:
      Successfully uninstalled aiohttp-3.13.2
  Attempting uninstall: xarray━━━━━━━━━━━━━━━━━━ 0/3 [aiohttp]
    Found existing installation: xarray 2025.12.00/3 [aiohttp]
    Uninstalling xarray-2025.12.0:━━━━━━━━━━ 0/3 [aiohttp]
      Successfully uninstalled xarray-2025.12.0━━━━━━━━━━━━━━━━━━━ 1/3 [xarray]
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3/3 [argopy]2m2/3 [argopy]
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
xmip 0.7.2 requires cf_xarray>=0.6.0, which is not installed.
xmip 0.7.2 requires xarrayutils, which is not installed.
xmip 0.7.2 requires xgcm<0.7.0, which is not installed.
Successfully installed aiohttp-3.12.15 argopy-1.4.0 xarray-2025.9.0

Step 2 Compute point estimates#

We will compute two point estimates from the profiles of CHLA: 1) the surface average from 0 to 10m and 2) binned averages from 0 to 200m in bins 10m wide. We will not be filtering based on CHLA_QC so as to not lose too much data.

Create functions to make these calculations from a profile#

average_profile takes the depth (PRES) and variable (CHLA) and gets the average over a depth range.
summarize_profile_binned takes a filtered dataframe with just one profile (PLATFORM_NUMBER/CYCLE_NUMBER) and gives means for binned depths like 0 to 10m.

# average over a depth band
def average_profile(depth, var, z_min=0.0, z_max=None):
    """
    Average a var profile over a depth range.

    Parameters
    ----------
    depth : array-like
        1D array of depth or pressure (m or dbar ≈ m), increasing downward.
    var : array-like
        1D array of variable same shape as `depth`.
    z_min : float, optional
        Lower bound of depth (e.g. 0 for surface), in same units as depth.
    z_max : float or None, optional
        Upper bound of depth. If None, use max(depth).

    Returns
    -------
    Mean of var [z_min, z_max]. Returns np.nan if insufficient data.
    Number of data points in [z_min, z_max].
    """
    depth = np.asarray(depth, dtype=float)
    var   = np.asarray(var,   dtype=float)

    # Mask out NaNs
    m = np.isfinite(depth) & np.isfinite(var)
    depth = depth[m]
    var   = var[m]

    if depth.size < 1:
        return np.nan, 0

    # Set z_max if not provided
    if z_max is None:
        z_max = depth.max()

    # Subset to [z_min, z_max]
    sel = (depth > z_min) & (depth <= z_max)
    depth = depth[sel]
    var   = var[sel]

    if depth.size < 1:
        return np.nan, 0

    # Mean
    return np.mean(var), len(var)

# function to create our metrics for a platform/cycle group
import numpy as np
import pandas as pd

def summarize_profile_binned(group, var, depth_bins=None, include_meta=False):
    """
    Summarize a single Bio-Argo profile into depth-binned CHL means and one integrated bin.

    Parameters
    ----------
    group : dataframe
        dataframe with PRES, TIME, LATITUDE, LONGITUDE, var. Expected to be from
        filtering a Argo dataframe by PLATFORM_NUMBER and CYCLE_NUMBER
    var : str
        name of variable like `CHLA`.
    depth_bins : list, optional
        list of bins. Default is 0,10,...,200
    include_meta : Bool, optional
        whether to include TIME, LATITUDE, LONGITUDE in the returned dataframe

    Returns
    -------
    Returns pd.Series with one row with CHL in each depth bin + metadata (if include_meta=True)
    """
    depth_bins = np.arange(0, 210, 10)  # 0,10,...,200

    z = group["PRES"].values
    c = group[var].values

    out = {}

    if include_meta:# metadata
        out["TIME"]      = group["TIME"].iloc[0]
        out["LATITUDE"]  = group["LATITUDE"].iloc[0]
        out["LONGITUDE"] = group["LONGITUDE"].iloc[0]

    # depth-binned means
    for z0, z1 in zip(depth_bins[:-1], depth_bins[1:]):
        col1 = f"{var}_{int(z0)}_{int(z1)}"
        col2 = f"{var}_{int(z0)}_{int(z1)}_N"
        out[col1], out[col2] = average_profile(z, c, z_min=z0, z_max=z1)

    return pd.Series(out)

The summarize_profile_binned will return the depth binned averages for each profile.

# convert the xarray dataset from Argo to dataframe
df = ds_na.to_dataframe().reset_index()

# QC first if desired
df_qc = df[df["CHLA_QC"].isin([1, 2])]

# Get the binned averages and add on some metadata
df_points = (
    df_qc
    .groupby(["PLATFORM_NUMBER", "CYCLE_NUMBER"], as_index=False)
    .apply(summarize_profile_binned, var="CHLA", include_meta=True, include_groups=False)   # default 0–200m in 10m bins
    .reset_index(drop=True)
)

df_points.head()

	PLATFORM_NUMBER	CYCLE_NUMBER	TIME	LATITUDE	LONGITUDE	CHLA_0_10	CHLA_0_10_N	CHLA_10_20	CHLA_10_20_N	CHLA_20_30	...	CHLA_150_160	CHLA_150_160_N	CHLA_160_170	CHLA_160_170_N	CHLA_170_180	CHLA_170_180_N	CHLA_180_190	CHLA_180_190_N	CHLA_190_200	CHLA_190_200_N
0	1902383	78	2024-03-02 16:13:24.002000128	23.0209	-53.2220	NaN	0	0.017404	1	0.016929	...	0.129163	5	0.101144	5	0.054446	5	0.042098	5	0.024844	5
1	1902383	79	2024-03-12 15:19:00.002000128	22.7279	-52.3932	0.011863	1	0.013578	6	0.012971	...	0.126472	5	0.095604	5	0.058087	5	0.041782	5	0.025952	5
2	1902383	80	2024-03-22 15:28:33.002000128	23.2649	-52.0005	NaN	0	0.023577	5	0.015662	...	0.065052	5	0.057928	5	0.062677	5	0.045739	5	0.027851	5
3	1902384	79	2024-03-08 20:58:18.002000128	20.0154	-42.7101	0.027295	5	0.028305	5	0.029748	...	0.136242	5	0.112432	5	0.074770	5	0.042880	5	0.028594	5
4	1902384	80	2024-03-18 20:31:41.002000128	20.3018	-42.5525	0.016256	4	0.018060	5	0.019359	...	0.201177	5	0.124553	5	0.078233	5	0.050095	5	0.037541	5

5 rows × 45 columns

Summary#

Now you can query the Bio-Argo ERDDAP server and get Argo profiles for any region and time frame. You can filter the data by quality and get averages for different depths.

Stop here is you just need to know how to get Bio-Argo data. In the next section, I show how I queried for the whole globe and assembled into monthly 10-20Mb netcdfs which I process into a 2Mb dataframe with rows for each profile (a single descend/ascend cycle for a buoy) with the binned CHLA averages. This dataframe is small enough to keep on GitHub with the tutorials. This has all profiles, CHLA data were not filtered by quality and CHLA for many depth bins, including surface, may be missing. You can load and plot this file as

# Load data from GitHub
import pandas as pd
url = (
    "https://raw.githubusercontent.com/"
    "fish-pace/fish-pace-datasets/main/"
    "datasets/chla_z/data/CHLA_argo_profiles.parquet"
)
df = pd.read_parquet(url)
print(f"\nNumber of profiles in the dataset {len(df)}\n")
df.head()

Number of profiles in the dataset 15833

	profile_id	PLATFORM_NUMBER	CYCLE_NUMBER	TIME	LATITUDE	LONGITUDE	CHLA_0_10	CHLA_0_10_N	CHLA_10_20	CHLA_10_20_N	...	CHLA_150_160	CHLA_150_160_N	CHLA_160_170	CHLA_160_170_N	CHLA_170_180	CHLA_170_180_N	CHLA_180_190	CHLA_180_190_N	CHLA_190_200	CHLA_190_200_N
0	1902304_0155	1902304	155	2024-03-01 21:23:16.002000128	54.6582	-19.2434	0.150386	4	0.158043	5	...	0.157366	5	0.158550	5	0.154490	5	0.153305	5	0.154151	5
1	1902304_0156	1902304	156	2024-03-11 20:45:53.002000128	54.9187	-18.9609	0.155674	4	0.156858	5	...	0.131986	5	0.135201	5	0.135539	5	0.136385	5	0.134524	5
2	1902304_0157	1902304	157	2024-03-21 21:21:39.002000128	55.2967	-18.8331	0.195647	4	0.202542	5	...	0.202204	5	0.200004	5	0.198820	5	0.194590	5	0.198651	5
3	1902304_0158	1902304	158	2024-03-31 21:31:53.002000128	55.7268	-18.8653	0.190783	4	0.193575	5	...	0.155336	5	0.153982	5	0.143492	5	0.135201	5	0.143830	5
4	1902380_0079	1902380	79	2024-03-07 18:01:17.002000128	17.6665	-46.0155	0.008062	4	0.009940	5	...	0.105501	5	0.066193	5	0.053091	5	0.043657	5	0.031253	5

5 rows × 46 columns

# plot the data
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
import cartopy.feature as cfeature

proj = ccrs.PlateCarree()

fig = plt.figure(figsize=(8, 6))
ax = plt.axes(projection=proj)

# Add coastlines and land
ax.coastlines(resolution="110m")
ax.add_feature(cfeature.LAND, facecolor="0.9")
ax.add_feature(cfeature.OCEAN, facecolor="white")
ax.gridlines(draw_labels=True, linestyle="--", alpha=0.5)

df_clean = df.dropna(subset=["CHLA_0_10"])

# Add the Argo points from df
ax.scatter(
    df_clean["LONGITUDE"].values,
    df_clean["LATITUDE"].values,
    s=1,
    marker=".",
    transform=proj,
)

ax.set_title("Argo CHLA data")

plt.show()

../../_images/72897b73ecce7d9ee94d8f069b7b8f7d630728a42eae446a22f8c3a56773e01d.png

Each dot is a profile (descend/ascend cycle for one buoy). Some regions are better represented that others.

If you want to see how I assembled the global file, read below.

Process data from the whole globe#

I will show 2 workflows. The first one gets Argo data and processes that into a dataframe with CHLA binned averages for each profile (descend/ascend cycle for a buoy). The dataframe has one row per Argo profile. Then these dataframes are assembled together into one parquet file for the whole globe. This problem with this is that if you change your mind regarding how to summarize the Argo data, then you have to query the Argo ERDDAP server again. If you are doing a lot of experimentation, then you keep hitting their server over and over (slow and excessive use of their server). So in the second workflow, I saved 10-20Mb netcdfs for all Bio-Argo data for a month and then worked with that as I experimented with how to summarize the profile data (measurements at each depth).

Workflow 1: Process into profile summaries directly#

There are 3 steps, which take in total about 4 hours.

Get profile summaries for a region and month. To do this create the function get_bgc_profile(). This will fetch profile Bio-Argo data for a region and month and save as parquet. Returns dataframe with one row per profile (a descend/ascend cycle for a buoy) with the whole CHLA binned profile (averages for different depth bins). Uses our summarize_profile_binned() function from above.
A for loop to work through the whole globe and save the monthly parquet files. This is the step that takes 4 hrs.
Merge these files altogether into one file to put on GitHub.

A function to get one month `get_bgc_profile()`#

Get a month for a region. Return a dataframe of the shallow points only. The function allows us to get multiple possible variables like CHLA, DOXY, NITRATE, BBP770, TEMP and PSAL. However if we pass in multiple variables then all must be present, so to maximize the data only pass in one varible at a time unless you need paired measurements.

# get_bgc_profile() function
from pathlib import Path

import numpy as np
import pandas as pd
from argopy import DataFetcher as ArgoDataFetcher


def get_bgc_profile(reg, mon, data_dir="data", vars=None, save=False):
    """
    Fetch profile Bio-Argo data for a region and month and (optionally) save as parquet.
    Returns dataframe with one row per profile (a descend/ascend cycle for a buoy)
    with the whole CHLA binned profile (averages for different depth bins).

    Parameters
    ----------
    reg : sequence
        [lon_min, lon_max, lat_min, lat_max, depth_min, depth_max]
        (same as argopy region, but without time bounds).
    mon : str or datetime-like
        Month to fetch, e.g. "2024-03" or "2024-03-01".
    data_dir : str, optional
        Directory where the parquet file will be saved.
    vars : list of str, optional
        Bio-Argo variables to fetch and surface-average (e.g. ["CHLA", "BBP700", "DOXY"]).
        Defaults to ["CHLA"].
    save : bool, optional
        If True, save a parquet file and return (df_surf, path).
        If False, just return (df_surf, None).

    Returns
    -------
    df_surf : pandas.DataFrame
        Near-surface samples (one row per profile) with requested variables.
    out_path : str or None
        Path to the saved parquet file, or None if save=False.
    """
    if vars is None:
        vars = ["CHLA"]  # default behaviour

    # Required base columns in the data from Bio-Argo
    base_cols = [
        "PLATFORM_NUMBER", "CYCLE_NUMBER",
        "TIME", "LATITUDE", "LONGITUDE", "PRES",
    ]

    # De-duplicate and drop any that are base columns (e.g. PRES)
    extra_vars_raw = list(dict.fromkeys(vars))
    extra_vars = [v for v in extra_vars_raw if v not in base_cols]

    def extra_cols(var):
        depth_bins = np.arange(0, 210, 10)  # 0,10,...,200
        cols = []
        for v in var:
            cols.append(f"{v}_INT_150")
            for z0, z1 in zip(depth_bins[:-1], depth_bins[1:]):
                cols.append(f"{v}_{int(z0)}_{int(z1)}")
        return cols
    # column names of the metrics, like CHLA_0_10
    var_metric_cols = extra_cols(extra_vars)

    # Columns that we need for processing the argo data
    argo_cols = base_cols + extra_vars
    # Final schema for the output DataFrame: base - PRES + colnames of points (like CHLA_INT_150)
    final_cols = base_cols + var_metric_cols 
    final_cols.remove("PRES");

    lon_min, lon_max, lat_min, lat_max, z_min, z_max = reg

    # Compute start/end of the month
    mon_start = pd.to_datetime(mon).to_period("M").start_time
    mon_end = (mon_start + pd.offsets.MonthBegin(1))

    region = [
        lon_min, lon_max,
        lat_min, lat_max,
        z_min, z_max,
        mon_start.strftime("%Y-%m-%d"),
        mon_end.strftime("%Y-%m-%d"),
    ]

    # STEP 1: Get data for region + month
    # We always request PRES, plus user extras (not TEMP/PSAL unless user asked)
    param_vars = list(dict.fromkeys(extra_vars + ["PRES"]))

    fetcher = ArgoDataFetcher(
        ds="bgc",
        src="erddap",
        params=param_vars,
    )

    try:
        ds = fetcher.region(region).to_xarray()
    except Exception as exc:
        print(f"No data for region={reg}, month={mon}: {exc}")
        # Return empty frame with full schema
        return pd.DataFrame(columns=final_cols), None

    # Bail if argopy returned an empty ds
    if "N_POINTS" in ds.sizes and ds.sizes["N_POINTS"] == 0:
        print(f"No data rows (N_POINTS=0) for region={reg}, month={mon}")
        return pd.DataFrame(columns=final_cols), None
    
    # Also guard against “all dims zero”
    if all(size == 0 for size in ds.sizes.values()):
        print(f"No data rows (all dims zero) for region={reg}, month={mon}")
        return pd.DataFrame(columns=final_cols), None    
        
    # STEP 2: Compute point metrics
    ds_vars = set(ds.data_vars) | set(ds.coords)

    # We only select columns that exist, but we *remember* full schema in argo_cols
    qc_cols = [f"{v}_QC" for v in extra_vars if f"{v}_QC" in ds_vars]
    available_cols = [c for c in argo_cols if c in ds_vars] + qc_cols
    available_cols = list(dict.fromkeys(available_cols))  # de-dup

    if not available_cols:
        print(f"No requested columns in dataset for region={reg}, month={mon}")
        return pd.DataFrame(columns=final_cols), None
        
    df_all = ds[available_cols].to_dataframe().reset_index(drop=True)

    if df_all.empty:
        print(f"No data rows for region={reg}, month={mon}")
        return pd.DataFrame(columns=final_cols), None

    # Ensure all required base columns are present before aggregating
    missing_base = [c for c in base_cols if c not in df_all.columns]
    if missing_base:
        raise ValueError(
            f"Missing required base columns {missing_base} in argo data "
            f"for region={reg}, month={mon}"
        )

    df_clean = df_all.copy()

    # STEP 2a: Base per-profile aggregation
    agg_dict = {
        "TIME": ("TIME", "first"),
        "LATITUDE": ("LATITUDE", "first"),
        "LONGITUDE": ("LONGITUDE", "first"),
    }

    df_points = (
        df_clean
        .groupby(["PLATFORM_NUMBER", "CYCLE_NUMBER"], as_index=False)
        .agg(**agg_dict)
    )

    # STEP 2b: per-variable metrics (using QC if available)
    for v in extra_vars:
        if v not in df_clean.columns:
            # We will add it later as NaN for schema consistency
            continue

        df_var = df_clean.copy()
        qc_col = f"{v}_QC"
        # If a QC column exists, filter by good values
        if qc_col in df_var.columns:
            df_var = df_var[df_var[qc_col].isin([1, 2])]

        if df_var.empty:
            # no good data for this variable; we'll add NaNs later
            continue

        df_points_var = (
            df_var
            .groupby(["PLATFORM_NUMBER", "CYCLE_NUMBER"], as_index=False)
            .apply(summarize_profile_binned, var=v, include_groups=False)   # default 0–200m in 10m bins
        )

        df_points = df_points.merge(
            df_points_var,
            on=["PLATFORM_NUMBER", "CYCLE_NUMBER"],
            how="left",
        )

    # STEP 2c: ensure all extras exist as columns (NaN if no data)
    for v in var_metric_cols :
        if v not in df_points.columns:
            df_points[v] = np.nan

    # Sanity check: all cols in final_cols must exist now
    missing_base_final = [c for c in final_cols if c not in df_points.columns]
    if missing_base_final:
        raise ValueError(
            f"After aggregation, missing base columns {missing_base_final} in df_points "
            f"for region={reg}, month={mon}"
        )

    # Reorder columns to match target schema exactly
    df_points = df_points.reindex(columns=final_cols)

    # STEP 3: Save to parquet in data/ dir (optional)
    data_path = Path(data_dir)
    data_path.mkdir(parents=True, exist_ok=True)

    out_fname = (
        f"argo_bgc_{lat_min}_{lat_max}_{lon_min}_{lon_max}_"
        f"{mon_start.strftime('%Y%m')}.parquet"
    )
    out_path = data_path / out_fname

    if save:
        df_points.to_parquet(out_path, index=False)
        return df_points, str(out_path)

    return df_points, None
    

%%time
# Example
params = ["CHLA"]
# [lon_min, lon_max, lat_min, lat_max, depth_min, depth_max]
region = [-55, -50, 20, 25, 0, 200]
month = "2024-03"
df_surf, _ = get_bgc_profile(region, month, vars=params)
df_surf.head()

CPU times: user 2.9 s, sys: 94.8 ms, total: 3 s
Wall time: 8.05 s

	PLATFORM_NUMBER	CYCLE_NUMBER	TIME	LATITUDE	LONGITUDE	CHLA_INT_150	CHLA_0_10	CHLA_10_20	CHLA_20_30	CHLA_30_40	...	CHLA_100_110	CHLA_110_120	CHLA_120_130	CHLA_130_140	CHLA_140_150	CHLA_150_160	CHLA_160_170	CHLA_170_180	CHLA_180_190	CHLA_190_200
0	1902383	78	2024-03-02 16:13:24.002000128	23.0209	-53.2220	0.085970	NaN	0.017404	0.016929	0.021678	...	0.107001	0.176812	0.258653	0.211954	0.170480	0.129163	0.101144	0.054446	0.042098	0.024844
1	1902383	79	2024-03-12 15:19:00.002000128	22.7279	-52.3932	0.109202	0.011863	0.013578	0.012971	0.017404	...	0.293004	0.237599	0.196283	0.168263	0.160982	0.126472	0.095604	0.058087	0.041782	0.025952
2	1902383	80	2024-03-22 15:28:33.002000128	23.2649	-52.0005	0.087394	NaN	0.023577	0.015662	0.018353	...	NaN	0.288097	0.201032	0.145627	0.105102	0.065052	0.057928	0.062677	0.045739	0.027851
3	1902385	78	2024-03-04 07:40:06.000999936	24.4592	-54.8044	0.075974	0.026190	0.026150	0.026465	0.025993	...	0.080926	0.139006	0.220854	0.195828	0.165449	0.120276	0.062982	0.037483	0.022845	0.010410
4	1902385	79	2024-03-14 09:51:52.000999936	24.5896	-54.7481	0.076172	NaN	NaN	NaN	0.023632	...	0.104378	0.171273	0.229354	0.190161	0.136803	0.106267	0.067389	0.045511	0.027724	0.013716

5 rows × 26 columns

df_surf.columns

Index(['PLATFORM_NUMBER', 'CYCLE_NUMBER', 'TIME', 'LATITUDE', 'LONGITUDE',
       'CHLA_INT_150', 'CHLA_0_10', 'CHLA_10_20', 'CHLA_20_30', 'CHLA_30_40',
       'CHLA_40_50', 'CHLA_50_60', 'CHLA_60_70', 'CHLA_70_80', 'CHLA_80_90',
       'CHLA_90_100', 'CHLA_100_110', 'CHLA_110_120', 'CHLA_120_130',
       'CHLA_130_140', 'CHLA_140_150', 'CHLA_150_160', 'CHLA_160_170',
       'CHLA_170_180', 'CHLA_180_190', 'CHLA_190_200'],
      dtype='object')

A for loop to work through the whole globe#

Loop through the globe and save monthly files. We will do this for all months when there is PACE data. This will take about four hours but we only have to do it once. The %%script false is added to prevent accidentally run the cell.

%%script false --no-raise-error
# comment out the above to run; this just prevents inadvertant running
# since this loops through the whole globe
import os
from pathlib import Path
import numpy as np
import pandas as pd

# Variables to get if avail. NaN if missing
BGC_VARS = ["CHLA"]
out_dir = Path("_temp_data/chla_profile")
out_dir.mkdir(parents=True, exist_ok=True)

# Months from 2024-03 up to the current month
start_month = "2024-03"
end_month = pd.Timestamp.today().to_period("M")
months = pd.period_range(start_month, end_month, freq="M")

for mon in months:
    month_str = mon.strftime("%Y-%m")
    out_path = out_dir / f"argo_bgc_global_{month_str}.parquet"

    # Skip if we already have this month (so reruns don't redo everything)
    if out_path.exists():
        print(f"Skipping {month_str}, already have {out_path}")
        continue

    print(f"\n=== Processing month {month_str} ===")

    dfs = []  # collect all boxes for this month

    # Latitude: 45° bands from -90 to 90 -> 4 bands
    for lat_min in range(-90, 90, 45):
        lat_max = lat_min + 45

        # Longitude: 60° bands from -180 to 180 -> 6 bands
        for lon_min in range(-180, 180, 60):
            lon_max = lon_min + 60
            region = [lon_min, lon_max, lat_min, lat_max, 0, 200]
            #print(f" -- region={region}")
            df_box, _ = get_bgc_profile(region, month_str, vars=BGC_VARS, save=False)
            if df_box is None or df_box.empty: continue
            dfs.append(df_box)

    # Desired column order (core metadata + all BGC vars)
    cols = [
        'profile_id', 'PLATFORM_NUMBER', 'CYCLE_NUMBER', 
        'TIME', 'LATITUDE', 'LONGITUDE',
       'CHLA_0_10', 'CHLA_10_20', 'CHLA_20_30', 'CHLA_30_40',
       'CHLA_40_50', 'CHLA_50_60', 'CHLA_60_70', 'CHLA_70_80', 'CHLA_80_90',
       'CHLA_90_100', 'CHLA_100_110', 'CHLA_110_120', 'CHLA_120_130',
       'CHLA_130_140', 'CHLA_140_150', 'CHLA_150_160', 'CHLA_160_170',
       'CHLA_170_180', 'CHLA_180_190', 'CHLA_190_200', 
    ]

    if not dfs:
        print(f"No data at all for month {month_str}, writing empty file")
        df_month = pd.DataFrame(columns=cols)
    else:
        df_month = pd.concat(dfs, ignore_index=True)

        # Make sure all expected columns exist; if missing, add as NaN
        # Create a stable profile_id: e.g. "6901234_0042"
        if "profile_id" not in df_month.columns:
            df_month["profile_id"] = (
                df_month["PLATFORM_NUMBER"].astype(int).astype(str).str.zfill(7)
                + "_"
                + df_month["CYCLE_NUMBER"].astype(int).astype(str).str.zfill(4)
            )

        for c in cols:
            if c not in df_month.columns:
                df_month[c] = np.nan

        # Now safely reorder columns
        df_month = df_month[cols]

    # Save one file per month
    df_month.to_parquet(out_path, index=False)
    print(f"Saved {len(df_month)} rows for {month_str} to {out_path}")

Merge all monthly files together into one parquet#

%%script false --no-raise-error
# requires above to have run
# example of one monthly file
from pathlib import Path
import numpy as np
import pandas as pd

var_dir = Path("_temp_data/chla_profile")
# these are monthly files
files = sorted(var_dir.glob("argo_bgc_global_*.parquet"))
df = pd.read_parquet(files[0])
df.head()

# function to process the monthly files
from pathlib import Path
import numpy as np
import pandas as pd

def merge_bgc_monthlies(var_dir, out_path):
    """
    Merge monthly Bio-Argo parquet files for a single variable (e.g., CHLA or BBP700).

    Parameters
    ----------
    var_dir : str or Path
        Directory containing monthly parquet files like 'argo_bgc_global_YYYY-MM.parquet'.
    out_path : str or Path
        Output parquet path for the merged dataset.
    """
    var_dir = Path(var_dir)
    files = sorted(var_dir.glob("argo_bgc_global_*.parquet"))

    if not files:
        raise FileNotFoundError(f"No parquet files found in {var_dir}")

    dfs = []
    for f in files:
        print(f"Reading {f}")
        df = pd.read_parquet(f)
        dfs.append(df)

    # Concatenate all months
    df_all = pd.concat(dfs, ignore_index=True)

    # Save merged parquet
    out_path = Path(out_path)
    out_path.parent.mkdir(parents=True, exist_ok=True)
    df_all.to_parquet(out_path, index=False)
    print(f"Saved merged dataset with {len(df_all)} rows to {out_path}")

%%script false --no-raise-error
# requires above to have run
# %%script to prevent accidentally rerunning this code
outfile = "_temp_data/argo_bgc_global_profile_CHLA.parquet"
merge_bgc_monthlies("_temp_data/chla_profile", outfile)

Once you have the data file on your computer, you load as below.

%%script false --no-raise-error
# Example. Requires above to have run
# Load from local file
from pathlib import Path
import numpy as np
import pandas as pd

file = "_temp_data/argo_bgc_global_profile_CHLA.parquet"
df = pd.read_parquet(file)

Process the monthly files#

Now that I have the data locally, we can easily create different metrics. Specifically, I will create the profile dataframe with CHLA binned depth averages for each profile. To do this, I run through all the months and create a list with dataframes from each month. Then I concatenate that into one dataframe with all the months. This is the final dataframe for GitHub. It is about 2.2Mb.

# Example of doing this for one month file
import xarray as xr
ds = xr.open_dataset("_temp_data/CHLA_argo_monthly_nc/argo_bgc_global_2025-07.nc")

df = ds.to_dataframe().reset_index()

# QC first
#df_qc = df[df["CHLA_QC"].isin([1, 2])]

df_points = (
    df
    .groupby(["PLATFORM_NUMBER", "CYCLE_NUMBER"], as_index=False)
    .apply(summarize_profile_binned, var="CHLA", include_meta=True, include_groups=False)   # default 0–200m in 10m bins
    .reset_index(drop=True)
)
df_points.shape

# Run through all the months and create a list with dataframes from each month
from pathlib import Path
import xarray as xr

var_dir = Path("_temp_data/CHLA_argo_monthly_nc")
files = sorted(var_dir.glob("argo_bgc_global_*.nc"))

profile_rows = []

for f in files:
    print(f"Reading {f}")
    ds = xr.open_dataset(f)
    df = ds.to_dataframe().reset_index()

    df_profiles = (
        df
        .groupby(["PLATFORM_NUMBER", "CYCLE_NUMBER"], as_index=False)
        .apply(summarize_profile_binned, var="CHLA", include_meta=True, include_groups=False)   # default 0–200m in 10m bins
        .reset_index(drop=True)
    )
    profile_rows.append(df_profiles)

Reading data/argo_monthly_nc/argo_bgc_global_2024-03.nc
Reading data/argo_monthly_nc/argo_bgc_global_2024-04.nc
Reading data/argo_monthly_nc/argo_bgc_global_2024-05.nc
Reading data/argo_monthly_nc/argo_bgc_global_2024-06.nc
Reading data/argo_monthly_nc/argo_bgc_global_2024-07.nc
Reading data/argo_monthly_nc/argo_bgc_global_2024-08.nc
Reading data/argo_monthly_nc/argo_bgc_global_2024-09.nc
Reading data/argo_monthly_nc/argo_bgc_global_2024-10.nc
Reading data/argo_monthly_nc/argo_bgc_global_2024-11.nc
Reading data/argo_monthly_nc/argo_bgc_global_2024-12.nc
Reading data/argo_monthly_nc/argo_bgc_global_2025-01.nc
Reading data/argo_monthly_nc/argo_bgc_global_2025-02.nc
Reading data/argo_monthly_nc/argo_bgc_global_2025-03.nc
Reading data/argo_monthly_nc/argo_bgc_global_2025-04.nc
Reading data/argo_monthly_nc/argo_bgc_global_2025-05.nc
Reading data/argo_monthly_nc/argo_bgc_global_2025-06.nc
Reading data/argo_monthly_nc/argo_bgc_global_2025-07.nc
Reading data/argo_monthly_nc/argo_bgc_global_2025-08.nc
Reading data/argo_monthly_nc/argo_bgc_global_2025-09.nc
Reading data/argo_monthly_nc/argo_bgc_global_2025-10.nc
Reading data/argo_monthly_nc/argo_bgc_global_2025-11.nc

# Merge the list of dataframes
profiles = pd.concat(profile_rows, ignore_index=True)
# Create a stable profile_id: e.g. "6901234_0042"
profiles["profile_id"] = (
    profiles["PLATFORM_NUMBER"].astype(int).astype(str).str.zfill(7)
    + "_"
    + profiles["CYCLE_NUMBER"].astype(int).astype(str).str.zfill(4)
)
# move profile_id to the start instead of the end
col = "profile_id"
profiles = profiles[[col] + [c for c in profiles.columns if c != col]]

Save the final version with metadata#

Use pyarrow package to add metadata. Adding thorough metadata when we are all done helps us know exactly how this parquet file was created.

import pyarrow as pa
import pyarrow.parquet as pq
from datetime import datetime

df = profiles

table = pa.Table.from_pandas(df)

file_meta = {
    "title": "Global Bio-Argo CHLA profile metrics (0 to 200 m, 10 m bins)",
    "creator": "Eli Holmes / NOAA https://orcid.org/0000-0001-9128-8393",
    "created": datetime.utcnow().isoformat() + "Z",
    "source": "BGC-Argo (via argopy). Argo (2000). Argo float data and metadata from Global Data Assembly Centre (Argo GDAC). SEANOE. https://doi.org/10.17882/42182",
    "description": (
        "All BGC-Argo data Mar 2024 to Nov 2025 with CHLA variable was downloaded."
        " Per-profile depth-binned CHLA means (0–200 m by 10 m bins) computed for each depth bin."
        " No QC filtering on the values was done using the CHLA_QC variable."
        " All profiles kept even if some binned averages were missing."
    ),
    "profile_id_definition": "profile_id = PLATFORM_NUMBER (7 digits) + '_' + CYCLE_NUMBER (4 digits)",
    "PLATFORM_NUMBER_definition": "PLATFORM_NUMBER from BGC-Argo identifying the buoy.",
    "CYCLE_NUMBER_definition": "CYCLE_NUMBER from BGC-Argo identifying the ascent/descent cycle.",
    "TIME_definition": "TIME in UTC from BGC-Argo. One time is assigned to each ascent/descent cycle.",
    "LATITUDE_definition": "LATITUDE from BGC-Argo. One is assigned to each ascent/descent cycle.",
    "LONGITUDE_definition": "LONGITUDE from BGC-Argo. One is assigned to each ascent/descent cycle.",
    "CHLA_A_B_definition": (
        "Depth binned averages of CHLA. Computed as the average of all individual CHLA measurements within the pressure interval "
        "(PRES>=A and PRES<B), where PRES is dbar and signifies depth."
        "No QC done on the CHLA data before averaging."
    ),
    "CHLA_A_B_N_definition": (
        "Number of individual CHLA measurements within the pressure interval "
        "(PRES>=A and PRES<B) filter used to compute the depth-binned mean."
    ),
    "variable_LATITUDE_standard_name": "latitude",
    "variable_LATITUDE_units": "degrees_north",
    "variable_LONGITUDE_standard_name": "longitude",
    "variable_LONGITUDE_units": "degrees_east",
    "variable_TIME_standard_name": "time",   
    "variable_TIME_units": "UTC",
    "variable_CHLA_A_B_standard_name": "mass_concentration_of_chlorophyll_a_in_sea_water",
    "variable_CHLA_A_B_units": "mg m-3",
    "variable_CHLA_A_B_N_long_name": "count of raw CHLA measurements in each depth bin",
    "variable_CHLA_A_B_N_units": "1",
    "CHLA_processing_description": (
        "CHLA values were taken directly from the BGC-Argo variable 'CHLA' "
        "(mg m-3). No additional sensor corrections or non-photochemical "
        "quenching adjustments were applied. CHLA_QC values were not used to "
        "filter measurements. CHLA measurements were aggregated into 10 m "
        "pressure bins between 0 and 200 dbar using arithmetic means."
    ),
    "CHLA_measurement_description": (
        "BGC-Argo chlorophyll-a (CHLA) is measured using a submersible "
        "chlorophyll fluorometer mounted on the float. The sensor emits blue light "
        "(~470 nm) and detects the resulting chlorophyll fluorescence near ~695 nm. "
        "Fluorescence intensity is converted onboard to chlorophyll-a concentration "
        "using factory calibration coefficients and reported in mg m-3 as the raw "
        "'CHLA' variable. Additional processing recommended by the BGC-Argo community "
        "(e.g., non-photochemical quenching correction, dark-count correction, and "
        "delayed-mode quality-control adjustments) was not applied; this dataset uses "
        "the unadjusted CHLA values provided in the core BGC-Argo data stream."
    ) ,
    "spatiotemporal_coverage_time_start": "2024-03-01T00:00:00Z",
    "spatiotemporal_coverage_time_end": "2025-11-30T23:59:59Z",
    "spatiotemporal_coverage_lat_min": "-90.0",
    "spatiotemporal_coverage_lat_max": "90.0",
    "spatiotemporal_coverage_lon_min": "-180.0",
    "spatiotemporal_coverage_lon_max": "180.0",
    "license": "Open access (Argo Data Policy); unrestricted use with attribution."
}
table = table.replace_schema_metadata(file_meta)

out_path = "data/CHLA_argo_profiles.parquet"
pq.write_table(table, out_path)

# Display metadata using pyarrow
import pyarrow.parquet as pq
out_path = "data/CHLA_argo_profiles.parquet"
t = pq.read_table(out_path)
t.schema.metadata

{b'title': b'Global Bio-Argo CHLA profile metrics (0 to 200 m, 10 m bins)',
 b'creator': b'Eli Holmes / NOAA https://orcid.org/0000-0001-9128-8393',
 b'created': b'2025-12-03T19:59:11.896859Z',
 b'source': b'BGC-Argo (via argopy). Argo (2000). Argo float data and metadata from Global Data Assembly Centre (Argo GDAC). SEANOE. https://doi.org/10.17882/42182',
 b'description': b'All BGC-Argo data Mar 2024 to Nov 2025 with CHLA variable was downloaded. Per-profile depth-binned CHLA means (0\xe2\x80\x93200 m by 10 m bins) computed for each depth bin. No QC filtering on the values was done using the CHLA_QC variable. All profiles kept even if some binned averages were missing.',
 b'profile_id_definition': b"profile_id = PLATFORM_NUMBER (7 digits) + '_' + CYCLE_NUMBER (4 digits)",
 b'PLATFORM_NUMBER_definition': b'PLATFORM_NUMBER from BGC-Argo identifying the buoy.',
 b'CYCLE_NUMBER_definition': b'CYCLE_NUMBER from BGC-Argo identifying the ascent/descent cycle.',
 b'TIME_definition': b'TIME in UTC from BGC-Argo. One time is assigned to each ascent/descent cycle.',
 b'LATITUDE_definition': b'LATITUDE from BGC-Argo. One is assigned to each ascent/descent cycle.',
 b'LONGITUDE_definition': b'LONGITUDE from BGC-Argo. One is assigned to each ascent/descent cycle.',
 b'CHLA_A_B_definition': b'Depth binned averages of CHLA. Computed as the average of all individual CHLA measurements within the pressure interval (PRES>=A and PRES<B), where PRES is dbar and signifies depth.No QC done on the CHLA data before averaging.',
 b'CHLA_A_B_N_definition': b'Number of individual CHLA measurements within the pressure interval (PRES>=A and PRES<B) filter used to compute the depth-binned mean.',
 b'variable_LATITUDE_standard_name': b'latitude',
 b'variable_LATITUDE_units': b'degrees_north',
 b'variable_LONGITUDE_standard_name': b'longitude',
 b'variable_LONGITUDE_units': b'degrees_east',
 b'variable_TIME_standard_name': b'time',
 b'variable_TIME_units': b'UTC',
 b'variable_CHLA_A_B_standard_name': b'mass_concentration_of_chlorophyll_a_in_sea_water',
 b'variable_CHLA_A_B_units': b'mg m-3',
 b'variable_CHLA_A_B_N_long_name': b'count of raw CHLA measurements in each depth bin',
 b'variable_CHLA_A_B_N_units': b'1',
 b'CHLA_processing_description': b"CHLA values were taken directly from the BGC-Argo variable 'CHLA' (mg m-3). No additional sensor corrections or non-photochemical quenching adjustments were applied. CHLA_QC values were not used to filter measurements. CHLA measurements were aggregated into 10 m pressure bins between 0 and 200 dbar using arithmetic means.",
 b'CHLA_measurement_description': b"BGC-Argo chlorophyll-a (CHLA) is measured using a submersible chlorophyll fluorometer mounted on the float. The sensor emits blue light (~470 nm) and detects the resulting chlorophyll fluorescence near ~695 nm. Fluorescence intensity is converted onboard to chlorophyll-a concentration using factory calibration coefficients and reported in mg m-3 as the raw 'CHLA' variable. Additional processing recommended by the BGC-Argo community (e.g., non-photochemical quenching correction, dark-count correction, and delayed-mode quality-control adjustments) was not applied; this dataset uses the unadjusted CHLA values provided in the core BGC-Argo data stream.",
 b'spatiotemporal_coverage_time_start': b'2024-03-01T00:00:00Z',
 b'spatiotemporal_coverage_time_end': b'2025-11-30T23:59:59Z',
 b'spatiotemporal_coverage_lat_min': b'-90.0',
 b'spatiotemporal_coverage_lat_max': b'90.0',
 b'spatiotemporal_coverage_lon_min': b'-180.0',
 b'spatiotemporal_coverage_lon_max': b'180.0',
 b'license': b'Open access (Argo Data Policy); unrestricted use with attribution.'}

Update our STAC json file#

A README.md should always be included so you know basically what your data are without having to open them up and look at the metadata. A STAC json file is a standard machine readable README for spatiotemporal datasets. We will create that and create a human readable README from that. This is a little extra work but if we get into the habit of creating nice clean datasets with good metadata, it gets easier to do this and makes it much easier for others to reuse our work.

# --- Custom python functions ---
import os, importlib
# Looks to see if you have the file already and if not, downloads from GitHub
if not os.path.exists("ml_utils.py"):
    !wget -q https://raw.githubusercontent.com/fish-pace/2025-tutorials/main/ml_utils.py

import ml_utils as mu
importlib.reload(mu)

<module 'ml_utils' from '/home/jovyan/2025-tutorials/ml_utils.py'>

# Create or update the STAC entry

collection_path = "data/tutorial_data_collection.json"

collection = mu.load_or_create_collection(collection_path)

chla_item_id = "global-bio-argo-chla-profile-metrics-0-200m-10m-bins"
chla_file_name = "CHLA_argo_profiles.parquet"
chla_href = f"https://raw.githubusercontent.com/fish-pace/2025-tutorials/main/data/{chla_file_name}"
notebook_href = "https://github.com/fish-pace/2025-tutorials/blob/main/argopy.ipynb"

collection = mu.add_or_update_item(
    collection,
    item_id=chla_item_id,
    asset_href=chla_href,
    title="Global Bio-Argo CHLA profile metrics (0–200 m, 10 m bins)",
    description=(
        "Per-profile CHLA metrics from all global BGC-Argo floats from Mar 2024 to Nov 2025, depth-binned from 0–200 m "
        "in 10 m bins using unadjusted CHLA values."
    ),
    start_datetime="2024-03-01T00:00:00Z",
    end_datetime="2025-11-30T23:59:59Z",
    extra_properties={
        "license": "Open access (Argo Data Policy); unrestricted use with attribution.",
        "variable": "CHLA",
        "platform": "BGC-Argo",
        "tutorial_notebook": notebook_href,
        "file_name": chla_file_name
    }
)

mu.save_collection(collection, collection_path)

mu.stac_to_readme(
    "data/tutorial_data_collection.json",
    readme_path="data/README.md",
    repo_raw_base="https://raw.githubusercontent.com/fish-pace/2025-tutorials/main"
)

README.md written to data/README.md

Summary#

Now we have the final dataframe that is on GitHub. If I want to experiment with other summaries of the Argo data, I have the raw data stored as monthly netcdfs.

# Load data from GitHub
import pandas as pd
url = "https://raw.githubusercontent.com/fish-pace/2025-tutorials/main/data/CHLA_argo_profiles.parquet"
df = pd.read_parquet(url)
df.head()

	profile_id	PLATFORM_NUMBER	CYCLE_NUMBER	TIME	LATITUDE	LONGITUDE	CHLA_0_10	CHLA_0_10_N	CHLA_10_20	CHLA_10_20_N	...	CHLA_150_160	CHLA_150_160_N	CHLA_160_170	CHLA_160_170_N	CHLA_170_180	CHLA_170_180_N	CHLA_180_190	CHLA_180_190_N	CHLA_190_200	CHLA_190_200_N
0	1902304_0155	1902304	155	2024-03-01 21:23:16.002000128	54.6582	-19.2434	0.150386	4	0.158043	5	...	0.157366	5	0.158550	5	0.154490	5	0.153305	5	0.154151	5
1	1902304_0156	1902304	156	2024-03-11 20:45:53.002000128	54.9187	-18.9609	0.155674	4	0.156858	5	...	0.131986	5	0.135201	5	0.135539	5	0.136385	5	0.134524	5
2	1902304_0157	1902304	157	2024-03-21 21:21:39.002000128	55.2967	-18.8331	0.195647	4	0.202542	5	...	0.202204	5	0.200004	5	0.198820	5	0.194590	5	0.198651	5
3	1902304_0158	1902304	158	2024-03-31 21:31:53.002000128	55.7268	-18.8653	0.190783	4	0.193575	5	...	0.155336	5	0.153982	5	0.143492	5	0.135201	5	0.143830	5
4	1902380_0079	1902380	79	2024-03-07 18:01:17.002000128	17.6665	-46.0155	0.008062	4	0.009940	5	...	0.105501	5	0.066193	5	0.053091	5	0.043657	5	0.031253	5

5 rows × 46 columns

# Get metadata using pyarrow
import fsspec
import pyarrow.parquet as pq

url = "https://raw.githubusercontent.com/fish-pace/2025-tutorials/main/data/CHLA_argo_profiles.parquet"
with fsspec.open(url, "rb") as f:
    t = pq.read_table(f)
t.schema.metadata

{b'title': b'Global Bio-Argo CHLA profile metrics (0 to 200 m, 10 m bins)',
 b'creator': b'Eli Holmes / NOAA https://orcid.org/0000-0001-9128-8393',
 b'created': b'2025-12-03T19:59:11.896859Z',
 b'source': b'BGC-Argo (via argopy). Argo (2000). Argo float data and metadata from Global Data Assembly Centre (Argo GDAC). SEANOE. https://doi.org/10.17882/42182',
 b'description': b'All BGC-Argo data Mar 2024 to Nov 2025 with CHLA variable was downloaded. Per-profile depth-binned CHLA means (0\xe2\x80\x93200 m by 10 m bins) computed for each depth bin. No QC filtering on the values was done using the CHLA_QC variable. All profiles kept even if some binned averages were missing.',
 b'profile_id_definition': b"profile_id = PLATFORM_NUMBER (7 digits) + '_' + CYCLE_NUMBER (4 digits)",
 b'PLATFORM_NUMBER_definition': b'PLATFORM_NUMBER from BGC-Argo identifying the buoy.',
 b'CYCLE_NUMBER_definition': b'CYCLE_NUMBER from BGC-Argo identifying the ascent/descent cycle.',
 b'TIME_definition': b'TIME in UTC from BGC-Argo. One time is assigned to each ascent/descent cycle.',
 b'LATITUDE_definition': b'LATITUDE from BGC-Argo. One is assigned to each ascent/descent cycle.',
 b'LONGITUDE_definition': b'LONGITUDE from BGC-Argo. One is assigned to each ascent/descent cycle.',
 b'CHLA_A_B_definition': b'Depth binned averages of CHLA. Computed as the average of all individual CHLA measurements within the pressure interval (PRES>=A and PRES<B), where PRES is dbar and signifies depth.No QC done on the CHLA data before averaging.',
 b'CHLA_A_B_N_definition': b'Number of individual CHLA measurements within the pressure interval (PRES>=A and PRES<B) filter used to compute the depth-binned mean.',
 b'variable_LATITUDE_standard_name': b'latitude',
 b'variable_LATITUDE_units': b'degrees_north',
 b'variable_LONGITUDE_standard_name': b'longitude',
 b'variable_LONGITUDE_units': b'degrees_east',
 b'variable_TIME_standard_name': b'time',
 b'variable_TIME_units': b'UTC',
 b'variable_CHLA_A_B_standard_name': b'mass_concentration_of_chlorophyll_a_in_sea_water',
 b'variable_CHLA_A_B_units': b'mg m-3',
 b'variable_CHLA_A_B_N_long_name': b'count of raw CHLA measurements in each depth bin',
 b'variable_CHLA_A_B_N_units': b'1',
 b'CHLA_processing_description': b"CHLA values were taken directly from the BGC-Argo variable 'CHLA' (mg m-3). No additional sensor corrections or non-photochemical quenching adjustments were applied. CHLA_QC values were not used to filter measurements. CHLA measurements were aggregated into 10 m pressure bins between 0 and 200 dbar using arithmetic means.",
 b'CHLA_measurement_description': b"BGC-Argo chlorophyll-a (CHLA) is measured using a submersible chlorophyll fluorometer mounted on the float. The sensor emits blue light (~470 nm) and detects the resulting chlorophyll fluorescence near ~695 nm. Fluorescence intensity is converted onboard to chlorophyll-a concentration using factory calibration coefficients and reported in mg m-3 as the raw 'CHLA' variable. Additional processing recommended by the BGC-Argo community (e.g., non-photochemical quenching correction, dark-count correction, and delayed-mode quality-control adjustments) was not applied; this dataset uses the unadjusted CHLA values provided in the core BGC-Argo data stream.",
 b'spatiotemporal_coverage_time_start': b'2024-03-01T00:00:00Z',
 b'spatiotemporal_coverage_time_end': b'2025-11-30T23:59:59Z',
 b'spatiotemporal_coverage_lat_min': b'-90.0',
 b'spatiotemporal_coverage_lat_max': b'90.0',
 b'spatiotemporal_coverage_lon_min': b'-180.0',
 b'spatiotemporal_coverage_lon_max': b'180.0',
 b'license': b'Open access (Argo Data Policy); unrestricted use with attribution.'}

Get Bio-ARGO data

Contents

Get Bio-ARGO data#

Workflow#

Step 1. Get some data#

Plot the data#

Profiles: a descending/ascending cycle#

Step 2 Compute point estimates#

Create functions to make these calculations from a profile#

Summary#

Process data from the whole globe#

Workflow 1: Process into profile summaries directly#

A function to get one month `get_bgc_profile()`#

A for loop to work through the whole globe#

Merge all monthly files together into one parquet#

Workflow 2: Save full Argo data queries first#

Save query file as netcdf `get_bgc_file()`#

For loop to get files for each month#

Merge the region/month files into one month file#

Create an index file with just the profile meta data#

Process the monthly files#

Save the final version with metadata#

Update our STAC json file#

Summary#

Get Bio-ARGO data

Contents

Get Bio-ARGO data#

Workflow#

Step 1. Get some data#

Plot the data#

Profiles: a descending/ascending cycle#

Step 2 Compute point estimates#

Create functions to make these calculations from a profile#

Summary#

Process data from the whole globe#

Workflow 1: Process into profile summaries directly#

A function to get one month get_bgc_profile()#

A for loop to work through the whole globe#

Merge all monthly files together into one parquet#

Workflow 2: Save full Argo data queries first#

Save query file as netcdf get_bgc_file()#

For loop to get files for each month#

Merge the region/month files into one month file#

Create an index file with just the profile meta data#

Process the monthly files#

Save the final version with metadata#

Update our STAC json file#

Summary#

A function to get one month `get_bgc_profile()`#

Save query file as netcdf `get_bgc_file()`#