Get Bio-ARGO data#
Author: Eli Holmes (NOAA) Last updated: November 21, 2025
Biogeochemical Argo (βBio-Argoβ or BGC-Argo) floats are autonomous profiling instruments that drift with ocean currents and periodically dive from the surface to depths of 1,000β2,000 meters, collecting a vertical profile of physical and biogeochemical properties. These floats are all over the worldβs oceans and the data are freely available via the Argo global data assembly centers (GDACs) Video on the Argo system.
Each float has a unique PLATFORM_NUMBER, and every time it dives and returns to the surface it produces a new profile, identified by its CYCLE_NUMBER. A single float may produce hundreds of profiles over several years. Bio-Argo floats carry optical, chemical, and physical sensors that measure variables such as chlorophyll-a (CHLA), temperature (TEMP), salinity (PSAL), pressure (PRES, which is used as depth), dissolved oxygen (DOXY), and nitrate (NITRATE). These data are distributed in a consistent, profile-based format: each profile (platform+cycle) contains measurements at multiple depths, along with time, latitude, longitude, and quality-control flags. To use these data for surface-matching with satellite products, we will extract the shallow measurements from each profile (pressure is less than 20dbar or 10dbar). We will use the argopy package to download Argo data from the Coriolis Argo Global Assembly Center.

Workflow#
Here is our basic workflow.
Use
argopyto fetch data for Bio-Argo profiles in a given region and timeFor each profile, save one point, the shallowest reading.
Process the globe in monthly chunks to not overload the ERDDAP server and save the monthly shallow points to a parquet file.
See the version 1 notebook for how I save the first files.
!pip install argopy
Collecting argopy
Using cached argopy-1.3.1-py3-none-any.whl.metadata (8.4 kB)
Collecting xarray>=2025.7 (from argopy)
Using cached xarray-2025.11.0-py3-none-any.whl.metadata (12 kB)
Requirement already satisfied: scipy>=1.5 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from argopy) (1.15.2)
Requirement already satisfied: numpy in /srv/conda/envs/notebook/lib/python3.12/site-packages (from argopy) (2.2.5)
Requirement already satisfied: erddapy>=2.2.1 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from argopy) (2.2.4)
Requirement already satisfied: netCDF4>=1.5.3 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from argopy) (1.7.2)
Requirement already satisfied: h5netcdf>=1.4.0 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from argopy) (1.6.1)
Collecting fsspec>=2025.5.1 (from argopy)
Using cached fsspec-2025.10.0-py3-none-any.whl.metadata (10 kB)
Requirement already satisfied: toolz>=0.8.2 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from argopy) (1.0.0)
Requirement already satisfied: requests>=2.28 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from argopy) (2.32.3)
Requirement already satisfied: aiohttp>=3.7 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from argopy) (3.11.18)
Requirement already satisfied: decorator>=5.1 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from argopy) (5.2.1)
Requirement already satisfied: packaging>=20.4 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from argopy) (25.0)
Requirement already satisfied: aiohappyeyeballs>=2.3.0 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from aiohttp>=3.7->argopy) (2.6.1)
Requirement already satisfied: aiosignal>=1.1.2 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from aiohttp>=3.7->argopy) (1.3.2)
Requirement already satisfied: attrs>=17.3.0 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from aiohttp>=3.7->argopy) (25.3.0)
Requirement already satisfied: frozenlist>=1.1.1 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from aiohttp>=3.7->argopy) (1.5.0)
Requirement already satisfied: multidict<7.0,>=4.5 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from aiohttp>=3.7->argopy) (6.4.3)
Requirement already satisfied: propcache>=0.2.0 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from aiohttp>=3.7->argopy) (0.3.1)
Requirement already satisfied: yarl<2.0,>=1.17.0 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from aiohttp>=3.7->argopy) (1.20.0)
Requirement already satisfied: idna>=2.0 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from yarl<2.0,>=1.17.0->aiohttp>=3.7->argopy) (3.10)
Requirement already satisfied: httpx>=0.25.0 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from erddapy>=2.2.1->argopy) (0.28.1)
Requirement already satisfied: pandas<3,>=0.25.2 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from erddapy>=2.2.1->argopy) (2.2.3)
Requirement already satisfied: pytz in /srv/conda/envs/notebook/lib/python3.12/site-packages (from erddapy>=2.2.1->argopy) (2025.2)
Requirement already satisfied: python-dateutil>=2.8.2 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from pandas<3,>=0.25.2->erddapy>=2.2.1->argopy) (2.8.2)
Requirement already satisfied: tzdata>=2022.7 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from pandas<3,>=0.25.2->erddapy>=2.2.1->argopy) (2025.2)
Requirement already satisfied: h5py in /srv/conda/envs/notebook/lib/python3.12/site-packages (from h5netcdf>=1.4.0->argopy) (3.13.0)
Requirement already satisfied: anyio in /srv/conda/envs/notebook/lib/python3.12/site-packages (from httpx>=0.25.0->erddapy>=2.2.1->argopy) (4.9.0)
Requirement already satisfied: certifi in /srv/conda/envs/notebook/lib/python3.12/site-packages (from httpx>=0.25.0->erddapy>=2.2.1->argopy) (2025.1.31)
Requirement already satisfied: httpcore==1.* in /srv/conda/envs/notebook/lib/python3.12/site-packages (from httpx>=0.25.0->erddapy>=2.2.1->argopy) (1.0.9)
Requirement already satisfied: h11>=0.16 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from httpcore==1.*->httpx>=0.25.0->erddapy>=2.2.1->argopy) (0.16.0)
Requirement already satisfied: cftime in /srv/conda/envs/notebook/lib/python3.12/site-packages (from netCDF4>=1.5.3->argopy) (1.6.4)
Requirement already satisfied: six>=1.5 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from python-dateutil>=2.8.2->pandas<3,>=0.25.2->erddapy>=2.2.1->argopy) (1.17.0)
Requirement already satisfied: charset_normalizer<4,>=2 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from requests>=2.28->argopy) (3.4.1)
Requirement already satisfied: urllib3<3,>=1.21.1 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from requests>=2.28->argopy) (1.26.19)
Requirement already satisfied: sniffio>=1.1 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from anyio->httpx>=0.25.0->erddapy>=2.2.1->argopy) (1.3.1)
Requirement already satisfied: typing_extensions>=4.5 in /srv/conda/envs/notebook/lib/python3.12/site-packages (from anyio->httpx>=0.25.0->erddapy>=2.2.1->argopy) (4.13.2)
Using cached argopy-1.3.1-py3-none-any.whl (892 kB)
Using cached fsspec-2025.10.0-py3-none-any.whl (200 kB)
Using cached xarray-2025.11.0-py3-none-any.whl (1.4 MB)
Installing collected packages: fsspec, xarray, argopy
Attempting uninstall: fsspec
Found existing installation: fsspec 2025.3.2
Uninstalling fsspec-2025.3.2:
Successfully uninstalled fsspec-2025.3.2
Attempting uninstall: xarrayββββββββββββββββββ 0/3 [fsspec]
Found existing installation: xarray 2025.3.1 0/3 [fsspec]
Uninstalling xarray-2025.3.1:βββββββββββ 0/3 [fsspec]
Successfully uninstalled xarray-2025.3.10m 0/3 [fsspec]
ββββββββββββββββββββββββββββββββββββββββ 3/3 [argopy]2m2/3 [argopy]
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
xmip 0.7.2 requires cf_xarray>=0.6.0, which is not installed.
xmip 0.7.2 requires xarrayutils, which is not installed.
xmip 0.7.2 requires xgcm<0.7.0, which is not installed.
dask-geopandas 0.4.3 requires dask[dataframe]>=2025.1.0, but you have dask 2024.11.0 which is incompatible.
gcsfs 2025.3.2 requires fsspec==2025.3.2, but you have fsspec 2025.10.0 which is incompatible.
s3fs 2025.3.2 requires fsspec==2025.3.2.*, but you have fsspec 2025.10.0 which is incompatible.
Successfully installed argopy-1.3.1 fsspec-2025.10.0 xarray-2025.11.0
Step 1. Get some data#
In this first step weβll grab some in-situ data from Bio-Argo floats. We use argopyβs ArgoDataFetcher function to set up a datafetcher object with the information on what dataset (ds arg), where (src arg) and what parameters. We can access data by a specific buoy (platform), cycle (one descending/ascending cycle of a float), or by region (any data in that region, depth, time). We will use region. Once we set up our datafetcher object, we request the data using a method like to_xarray() or to_dataframe() to get the data and return in a specific form. We will use to_xarray() because the processing by buoy/cycle to get the shallowest points will be easier using xarray.
The code below requests only chlorophyll (CHLA) and pressure (PRES) for March 2024 in our chosen region and returns the result as an xarray Dataset that we can later save as a parquet file.
# get some data
from argopy import DataFetcher as ArgoDataFetcher
# specify the data set and what source
fetcher = ArgoDataFetcher(
ds="bgc",
src="erddap",
params=["CHLA", "PRES"] # both these need to be present
)
# specify that we want to access data in a region; a NW Atlantic box
# lon_min, lon_max, lat_min, lat_max, depth_min, depth_max, time_min, time_max
region = [-70, -40, 20, 60, 0, 1000, "2024-03-01", "2024-04-01"]
fetcher = fetcher.region(region)
fetcher
<datafetcher.erddap>
β Name: Ifremer erddap Argo BGC data fetcher for a space/time region
πΊ Domain: [x=-70.00/-40.00; y=20.00/60.00; z=0.0/1000.0; t=2024-03-01/2024-04-01]
π API: https://erddap.ifremer.fr/erddap
π Parameters: ['CHLA', 'PRES', 'TEMP', 'PSAL']
π BGC 'must be measured' parameters: []
π User mode: standard
π’ Dataset: bgc-s
π₯ Performances: cache=False, parallel=False
# Get the data using the to_xarray() method
ds_na = fetcher.to_xarray()
ds_na
<xarray.Dataset> Size: 1MB
Dimensions: (N_POINTS: 9267)
Coordinates:
* N_POINTS (N_POINTS) int64 74kB 0 1 2 3 4 ... 9263 9264 9265 9266
LATITUDE (N_POINTS) float64 74kB 27.53 35.76 23.02 ... 39.93 26.89
LONGITUDE (N_POINTS) float64 74kB -50.32 -56.92 ... -68.26 -50.2
TIME (N_POINTS) datetime64[ns] 74kB 2024-03-01T05:16:51.00099...
Data variables: (12/21)
CHLA (N_POINTS) float32 37kB 0.0036 0.0036 ... 0.005341 0.0
CHLA_DATA_MODE (N_POINTS) <U1 37kB 'A' 'A' 'A' 'A' 'A' ... 'A' 'A' 'A' 'A'
CHLA_ERROR (N_POINTS) float32 37kB 0.0072 0.0072 0.02847 ... nan 0.0
CHLA_QC (N_POINTS) int64 74kB 1 1 5 5 5 5 5 5 5 ... 1 1 1 1 1 1 1 1
CYCLE_NUMBER (N_POINTS) int64 74kB 107 103 78 78 78 ... 21 21 21 21 110
DIRECTION (N_POINTS) <U1 37kB 'A' 'A' 'A' 'A' 'A' ... 'A' 'A' 'A' 'A'
... ...
PSAL_QC (N_POINTS) int64 74kB 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 1
TEMP (N_POINTS) float32 37kB 7.259 7.285 24.82 ... 4.812 7.062
TEMP_DATA_MODE (N_POINTS) <U1 37kB 'D' 'D' 'A' 'A' 'A' ... 'R' 'R' 'R' 'D'
TEMP_ERROR (N_POINTS) float32 37kB 0.002 0.002 nan ... nan nan 0.002
TEMP_QC (N_POINTS) int64 74kB 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 1
TIME_QC (N_POINTS) int64 74kB 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 1
Attributes:
DATA_ID: ARGO-BGC
DOI: http://doi.org/10.17882/42182
Fetched_from: erddap.ifremer.fr
Fetched_by: jovyan
Fetched_date: 2025/11/22
Fetched_constraints: [x=-70.00/-40.00; y=20.00/60.00; z=0.0/1000.0; t=20...
Fetched_uri: https://erddap.ifremer.fr/erddap/tabledap/ArgoFloat...
Processing_history: [CHLA,PRES,PSAL,TEMP] real-time and adjusted/delaye...Plot the data#
# plot the data; includes all buoys and all depths during each cycle
lon_min, lon_max, lat_min, lat_max = region[:4]
ds = ds_na
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
import cartopy.feature as cfeature
# Corners of the box (closed loop)
lons = [lon_min, lon_max, lon_max, lon_min, lon_min]
lats = [lat_min, lat_min, lat_max, lat_max, lat_min]
proj = ccrs.PlateCarree()
fig = plt.figure(figsize=(8, 6))
ax = plt.axes(projection=proj)
# Show a bit more context than just the box
ax.set_extent([lon_min - 20, lon_max + 20, lat_min - 10, lat_max + 10], crs=proj)
# Add coastlines and land
ax.coastlines(resolution="110m")
ax.add_feature(cfeature.LAND, facecolor="0.9")
ax.add_feature(cfeature.OCEAN, facecolor="white")
ax.gridlines(draw_labels=True, linestyle="--", alpha=0.5)
# Plot the bounding box
ax.plot(lons, lats, transform=proj, linewidth=2)
ax.scatter([lon_min, lon_max], [lat_min, lat_max], transform=proj)
# Add the Argo points from ds
ax.scatter(
ds["LONGITUDE"].values,
ds["LATITUDE"].values,
s=20,
marker="o",
transform=proj,
)
ax.set_title("Argo BGC region bounding box")
plt.show()
Profiles: a descending/ascending cycle#
The data for each buoy are organized into cycles which is data collected on one descending/ascending cycle. Depending how the buoy is set up to record data, it might record only on the descending or ascending part of the cycle. The data that the ocean color satellites measure is just the surface. Even if there is high CHLA in deep water, the satellite does not βseeβ that. Thus we need to filter our cycle data from just the upper layer (< 20 m or so).
import numpy as np
import matplotlib.pyplot as plt
ds = ds_na # just to keep the name short
# Unique platform numbers in the dataset
platforms = np.unique(ds["PLATFORM_NUMBER"].values)
plat = int(platforms[0])
# All cycles for that platform
cycles = ds["CYCLE_NUMBER"].where(ds["PLATFORM_NUMBER"] == plat, drop=True)
cycles = np.unique(cycles.values)
cyc = int(cycles[0])
# Select all points belonging to that profile
prof = ds.where(
(ds["PLATFORM_NUMBER"] == plat) & (ds["CYCLE_NUMBER"] == cyc),
drop=True
)
chl = prof["CHLA"]
pres = prof["PRES"]
direction = prof["DIRECTION"] # 'A' or 'D'
# Depth in meters ~ pressure in dbar
depth = pres
depth_plot = depth.clip(min=1) # avoid 0 for log scale
# Masks for ascending/descending
asc_mask = (direction == "A")
des_mask = (direction == "D")
fig, ax = plt.subplots(figsize=(4, 6))
# Plot descending profile
ax.scatter(
chl.where(des_mask),
depth_plot.where(des_mask),
label="Descending",
s=30,
)
# Plot ascending profile
ax.scatter(
chl.where(asc_mask),
depth_plot.where(asc_mask),
label="Ascending",
s=30,
marker="^",
)
# Add a shaded layer representing the satellite-sensitive layer (e.g. 0β20 m)
z_sat = 20 # meters; adjust to taste for your explanation
ax.axhspan(0, z_sat, color="lightgrey", alpha=0.5,
label="Approx. satellite-sensitive layer")
# Log scale on depth axis, surface at top
ax.set_yscale("log")
ax.invert_yaxis()
ax.set_xlabel("CHLA (mg m$^{-3}$)")
ax.set_ylabel("Log10 Depth (m, log scale)")
ax.set_title(f"Bio-Argo profile\nFloat {plat}, cycle {cyc}")
ax.grid(True, which="both", alpha=0.4)
ax.legend(loc="lower right")
plt.tight_layout()
plt.show()
import numpy as np
import matplotlib.pyplot as plt
ds = ds_na # just to keep the name short
# Unique platform numbers in the dataset
platforms = np.unique(ds["PLATFORM_NUMBER"].values)
plat = int(platforms[0])
# All cycles for that platform
cycles = ds["CYCLE_NUMBER"].where(ds["PLATFORM_NUMBER"] == plat, drop=True)
cycles = np.unique(cycles.values)
cyc = int(cycles[0])
# Select all points belonging to that profile
prof = ds.where(
(ds["PLATFORM_NUMBER"] == plat) & (ds["CYCLE_NUMBER"] == cyc),
drop=True
)
time = prof["TIME"]
pres = prof["PRES"]
# Depth in meters ~ pressure in dbar
depth = pres
depth_plot = depth.clip(min=1) # avoid 0 for log scale
fig, ax = plt.subplots(figsize=(4, 6))
# Plot profile
ax.scatter(time, depth_plot, s=30)
# Add a shaded layer representing the satellite-sensitive layer (e.g. 0β20 m)
z_sat = 20 # meters; adjust to taste for your explanation
ax.axhspan(0, z_sat, color="lightgrey", alpha=0.5,
label="Approx. satellite-sensitive layer")
# Log scale on depth axis, surface at top
ax.set_yscale("log")
ax.invert_yaxis()
ax.set_xlabel("time)")
ax.set_ylabel("Log10 Depth (m, log scale)")
ax.set_title(f"Bio-Argo profile\nFloat {plat}, cycle {cyc}")
ax.grid(True, which="both", alpha=0.4)
ax.legend(loc="lower right")
plt.tight_layout()
plt.show()
Step 2 Filter to only one point (average) from shallow values#
We want just a single point, from the < 20m depth, from each of these cycles. We will take the mean of all points in that depth. Our dataframe will have lots of points, but actually very few cycles and fewer cycles that reached the near surface. In this case we had over 9000 points for the whole of March 2024 in the NW Atlantic but only 30 points of near surface CHLA.
# How many row in our dataframe?
df.shape[0]
9267
# Convert to a flat table
df = ds_na.to_dataframe().reset_index()
# Keep only measurements shallower than 20 dbar
df_shallow = df[df["PRES"] < 20]
# Keep only high quality values
df_shallow = df_shallow[df_shallow["CHLA_QC"].isin([1, 2])]
# For each buoy+cycle group, compute surface-layer CHL as the mean CHLA
# and keep some representative metadata (lat/lon/time from the shallow layer)
df_surf = (
df_shallow
.groupby(["PLATFORM_NUMBER", "CYCLE_NUMBER"], as_index=False)
.agg(
CHLA=("CHLA", "mean"),
CHLA_ERROR=("CHLA_ERROR", "mean"),
PRES=("PRES", "mean"),
TEMP=("TEMP", "mean"),
LATITUDE=("LATITUDE", "first"),
LONGITUDE=("LONGITUDE", "first"),
TIME=("TIME", "first"),
CHLA_count=("CHLA", "count"), # how many points are averaged
)
)
print(df_surf.iloc[:5, :])
print("Number of cycles with surface-layer samples:", len(df_surf))
PLATFORM_NUMBER CYCLE_NUMBER CHLA CHLA_ERROR PRES TEMP \
0 1902383 78 0.017404 0.034807 19.889999 24.718000
1 1902383 79 0.013333 0.026666 13.980000 25.574713
2 1902383 80 0.023577 0.047154 15.850000 25.771799
3 1902384 79 0.027800 0.055601 10.969999 25.717001
4 1902384 80 0.017258 0.034517 10.051112 26.085335
LATITUDE LONGITUDE TIME CHLA_count
0 23.0209 -53.2220 2024-03-02 16:13:24.002000128 1
1 22.7279 -52.3932 2024-03-12 15:19:00.002000128 7
2 23.2649 -52.0005 2024-03-22 15:28:33.002000128 5
3 20.0154 -42.7101 2024-03-08 20:58:18.002000128 10
4 20.3018 -42.5525 2024-03-18 20:31:41.002000128 9
Number of cycles with surface-layer samples: 25
Put it together to process the globe#
Those are the 2 steps that we will do for each part of the globe and each month since March 2024. We create a function that combines these 2 steps to get one month of data for a region and then we use a for loop to work through the globe. We save the processed monthly data.
A function to get one month#
Get a month for a region. Return a dataframe of the shallow points only. The function allows us to get multiple possible variables like CHLA, DOXY, NITRATE, BBP770, TEMP and PSAL. However if we pass in multiple variables then all must be present, so to maximize the data only pass in one varible at a time unless you need paired measurements.
# get_bgc_surface() function
from pathlib import Path
import numpy as np
import pandas as pd
from argopy import DataFetcher as ArgoDataFetcher
def get_bgc_surface(reg, mon, data_dir="data", vars=None, save=False, pres_max=20):
"""
Fetch near-surface Bio-Argo data for a region and month and (optionally) save as parquet.
Parameters
----------
reg : sequence
[lon_min, lon_max, lat_min, lat_max, depth_min, depth_max]
(same as argopy region, but without time bounds).
mon : str or datetime-like
Month to fetch, e.g. "2024-03" or "2024-03-01".
data_dir : str, optional
Directory where the parquet file will be saved.
vars : list of str, optional
Bio-Argo variables to fetch and surface-average (e.g. ["CHLA", "BBP700", "DOXY"]).
Defaults to ["CHLA"].
save : bool, optional
If True, save a parquet file and return (df_surf, path).
If False, just return (df_surf, None).
pres_max : float, optional
Maximum pressure (dbar) for the surface layer (e.g. 20 -> 0β20 dbar).
Returns
-------
df_surf : pandas.DataFrame
Near-surface samples (one row per profile) with requested variables.
out_path : str or None
Path to the saved parquet file, or None if save=False.
"""
if vars is None:
vars = ["CHLA"] # default behaviour
# Required base columns in the output DataFrame
base_cols = [
"PLATFORM_NUMBER", "CYCLE_NUMBER",
"TIME", "LATITUDE", "LONGITUDE", "PRES",
]
# De-duplicate and drop any that are base columns (e.g. PRES)
extra_vars_raw = list(dict.fromkeys(vars))
extra_vars = [v for v in extra_vars_raw if v not in base_cols]
# Treat TEMP/PSAL as "extras" too if you want them in the schema
extras_plus = extra_vars + [v for v in ["TEMP", "PSAL"] if v not in extra_vars]
# Final schema for the DataFrame: base + extras
final_cols = base_cols + extras_plus
lon_min, lon_max, lat_min, lat_max, z_min, z_max = reg
# Compute start/end of the month
mon_start = pd.to_datetime(mon).to_period("M").start_time
mon_end = (mon_start + pd.offsets.MonthBegin(1))
region = [
lon_min, lon_max,
lat_min, lat_max,
z_min, z_max,
mon_start.strftime("%Y-%m-%d"),
mon_end.strftime("%Y-%m-%d"),
]
# STEP 1: Get data for region + month
# We always request PRES, plus user extras (not TEMP/PSAL unless user asked)
param_vars = list(dict.fromkeys(extra_vars + ["PRES"]))
fetcher = ArgoDataFetcher(
ds="bgc",
src="erddap",
params=param_vars,
)
try:
ds = fetcher.region(region).to_xarray()
except Exception as exc:
print(f"No data for region={reg}, month={mon}: {exc}")
# Return empty frame with full schema
return pd.DataFrame(columns=final_cols), None
# STEP 2: Filter to approximate surface data
ds_vars = set(ds.data_vars) | set(ds.coords)
# We only select columns that exist, but we *remember* full schema in final_cols
available_cols = [c for c in final_cols if c in ds_vars or c in base_cols]
# base_cols should all exist in ds/coords
missing_base_in_ds = [c for c in base_cols if c not in ds_vars]
if missing_base_in_ds:
raise ValueError(
f"Missing required base columns {missing_base_in_ds} in dataset "
f"for region={reg}, month={mon}"
)
df_all = ds[available_cols].to_dataframe().reset_index(drop=True)
if df_all.empty:
print(f"No data rows for region={reg}, month={mon}")
return pd.DataFrame(columns=final_cols), None
# Drop rows with missing PRES
if "PRES" not in df_all.columns:
raise ValueError(
f"'PRES' missing from data for region={reg}, month={mon} "
"after to_dataframe."
)
df_all = df_all.dropna(subset=["PRES"])
if df_all.empty:
print(f"No rows with valid PRES for region={reg}, month={mon}")
return pd.DataFrame(columns=final_cols), None
# Restrict to surface layer
df_shallow = df_all[df_all["PRES"] < pres_max].copy()
if df_shallow.empty:
print(f"No near-surface rows (PRES < {pres_max}) for region={reg}, month={mon}")
return pd.DataFrame(columns=final_cols), None
# Ensure all base columns are present before aggregating
missing_base_shallow = [c for c in base_cols if c not in df_shallow.columns]
if missing_base_shallow:
raise ValueError(
f"Missing required base columns {missing_base_shallow} in shallow data "
f"for region={reg}, month={mon}"
)
# STEP 2a: Base per-profile aggregation
agg_dict = {
"PRES": ("PRES", "mean"),
"LATITUDE": ("LATITUDE", "first"),
"LONGITUDE": ("LONGITUDE", "first"),
"TIME": ("TIME", "first"),
}
df_surf = (
df_shallow
.groupby(["PLATFORM_NUMBER", "CYCLE_NUMBER"], as_index=False)
.agg(**agg_dict)
)
# STEP 2b: per-variable surface means (using QC if available)
for v in extras_plus:
if v not in df_shallow.columns:
# We will add it later as NaN for schema consistency
continue
df_var = df_shallow.copy()
qc_col = f"{v}_QC"
# If a QC column exists, filter by good values
if qc_col in df_var.columns:
df_var = df_var[df_var[qc_col].isin([1, 2])]
if df_var.empty:
# no good data for this variable; we'll add NaNs later
continue
df_surf_var = (
df_var
.groupby(["PLATFORM_NUMBER", "CYCLE_NUMBER"], as_index=False)
.agg(**{v: (v, "mean")})
)
df_surf = df_surf.merge(
df_surf_var,
on=["PLATFORM_NUMBER", "CYCLE_NUMBER"],
how="left",
)
# STEP 2c: ensure all extras exist as columns (NaN if no data)
for v in extras_plus:
if v not in df_surf.columns:
df_surf[v] = np.nan
# Sanity check: all base_cols must exist now
missing_base_final = [c for c in base_cols if c not in df_surf.columns]
if missing_base_final:
raise ValueError(
f"After aggregation, missing base columns {missing_base_final} in df_surf "
f"for region={reg}, month={mon}"
)
# Reorder columns to match target schema exactly
df_surf = df_surf.reindex(columns=final_cols)
# STEP 3: Save to parquet in data/ dir (optional)
data_path = Path(data_dir)
data_path.mkdir(parents=True, exist_ok=True)
out_fname = (
f"argo_bgc_{lat_min}_{lat_max}_{lon_min}_{lon_max}_"
f"{mon_start.strftime('%Y%m')}.parquet"
)
out_path = data_path / out_fname
if save:
df_surf.to_parquet(out_path, index=False)
return df_surf, str(out_path)
return df_surf, None
%%time
# Example
params = ["CHLA", "BBP700", "DOXY"]
# [lon_min, lon_max, lat_min, lat_max, depth_min, depth_max]
region = [-70, -10, 0, 60, 0, 100]
month = "2024-03"
df_surf, _ = get_bgc_surface(region, month, vars=params)
df_surf.head()
| PLATFORM_NUMBER | CYCLE_NUMBER | TIME | LATITUDE | LONGITUDE | PRES | CHLA | BBP700 | DOXY | TEMP | PSAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1902304 | 155 | 2024-03-01 21:23:16.002000128 | 54.6582 | -19.2434 | 10.380000 | 0.154640 | 0.000414 | 258.693817 | 10.524889 | 35.415699 |
| 1 | 1902304 | 156 | 2024-03-11 20:45:53.002000128 | 54.9187 | -18.9609 | 10.461111 | 0.156332 | 0.000427 | 259.256744 | 10.519778 | 35.414875 |
| 2 | 1902304 | 157 | 2024-03-21 21:21:39.002000128 | 55.2967 | -18.8331 | 10.148889 | 0.199478 | 0.000464 | 261.829529 | 10.450444 | 35.413017 |
| 3 | 1902304 | 158 | 2024-03-31 21:31:53.002000128 | 55.7268 | -18.8653 | 10.332222 | 0.192334 | 0.000493 | 261.111633 | 10.431667 | 35.407547 |
| 4 | 1902380 | 79 | 2024-03-07 18:01:17.002000128 | 17.6665 | -46.0155 | 10.903333 | 0.009105 | 0.000473 | 206.667984 | 26.683889 | 36.246162 |
A for loop to work through the whole globe#
Loop through the globe and save monthly files. We will do this for all months when there is PACE data. This will take about four hours but we only have to do it once. The %%script false is added to prevent accidentally run the cell.
%%script false --no-raise-error
# comment out the above to run; this just prevent inadvertant running
# since this loops through the whole globe
import os
from pathlib import Path
import numpy as np
import pandas as pd
# Variables to get if avail. NaN if missing
BGC_VARS = ["BBP700"]
out_dir = Path("data/bbp700")
out_dir.mkdir(parents=True, exist_ok=True)
# Months from 2024-03 up to the current month
start_month = "2024-03"
end_month = pd.Timestamp.today().to_period("M")
months = pd.period_range(start_month, end_month, freq="M")
for mon in months:
month_str = mon.strftime("%Y-%m")
out_path = out_dir / f"argo_bgc_global_{month_str}.parquet"
# Skip if we already have this month (so reruns don't redo everything)
if out_path.exists():
print(f"Skipping {month_str}, already have {out_path}")
continue
print(f"\n=== Processing month {month_str} ===")
dfs = [] # collect all boxes for this month
# Latitude: 45Β° bands from -90 to 90 -> 4 bands
for lat_min in range(-90, 90, 45):
lat_max = lat_min + 45
# Longitude: 60Β° bands from -180 to 180 -> 6 bands
for lon_min in range(-180, 180, 60):
lon_max = lon_min + 60
region = [lon_min, lon_max, lat_min, lat_max, 0, 100]
df_box, _ = get_bgc_surface(region, month_str, vars=BGC_VARS, save=False, pres_max=10)
if df_box is None or df_box.empty: continue
dfs.append(df_box)
# Desired column order (core metadata + all BGC vars)
base_cols = [
"TIME", "LATITUDE", "LONGITUDE",
"PLATFORM_NUMBER", "CYCLE_NUMBER", "PRES",
]
cols = base_cols + BGC_VARS + ["TEMP", "PSAL"]
if not dfs:
print(f"No data at all for month {month_str}, writing empty file")
df_month = pd.DataFrame(columns=cols)
else:
df_month = pd.concat(dfs, ignore_index=True)
# Make sure all expected columns exist; if missing, add as NaN
for c in cols:
if c not in df_month.columns:
df_month[c] = np.nan
# Now safely reorder columns
df_month = df_month[cols]
# Save one file per month
df_month.to_parquet(out_path, index=False)
print(f"Saved {len(df_month)} rows for {month_str} to {out_path}")
Merge all together into one parquet#
# example of one file
from pathlib import Path
import numpy as np
import pandas as pd
var_dir = Path("data/chla")
files = sorted(var_dir.glob("argo_bgc_global_*.parquet"))
df = pd.read_parquet(files[0])
df
| TIME | LATITUDE | LONGITUDE | PLATFORM_NUMBER | CYCLE_NUMBER | PRES | CHLA | TEMP | PSAL | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2024-03-01 15:11:04.002000128 | -49.643800 | -154.520400 | 5906311 | 115 | 11.020000 | 1.185197 | 13.093500 | 34.594406 |
| 1 | 2024-03-11 15:26:06.002000128 | -49.911800 | -153.935800 | 5906311 | 116 | 10.221111 | 1.443537 | 13.574444 | 34.602802 |
| 2 | 2024-03-21 14:29:38.002000128 | -49.907700 | -152.864700 | 5906311 | 117 | 10.880000 | 1.363146 | 13.706301 | 34.503258 |
| 3 | 2024-03-31 14:47:29.002000128 | -49.520700 | -151.631500 | 5906311 | 118 | 10.960001 | 1.156802 | 13.522700 | 34.376629 |
| 4 | 2024-03-02 07:20:27.000999936 | -50.550600 | -126.866200 | 5906444 | 47 | 10.572223 | 0.478140 | 10.031666 | 34.216404 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 488 | 2024-03-05 11:32:39.019000064 | 69.673916 | 4.204859 | 7901028 | 55 | 1.020000 | 0.021900 | 5.909500 | 35.102501 |
| 489 | 2024-03-11 05:30:41.015000064 | 69.274508 | 2.537495 | 7901028 | 56 | 1.256754 | 0.135456 | 6.698943 | 35.136322 |
| 490 | 2024-03-16 11:32:39.024000000 | 68.702584 | 3.804852 | 7901028 | 57 | 1.214967 | 0.153300 | 5.696755 | 35.107090 |
| 491 | 2024-03-21 17:40:41.012999936 | 68.602906 | 6.045098 | 7901028 | 58 | 1.500066 | 0.222650 | 5.852717 | 35.102001 |
| 492 | 2024-03-27 11:30:40.017999872 | 68.507574 | 6.783070 | 7901028 | 59 | 2.006647 | 0.156950 | 6.182973 | 35.116955 |
493 rows Γ 9 columns
# function to process the monthly files
from pathlib import Path
import numpy as np
import pandas as pd
def merge_bgc_monthlies(var_dir, out_path, var_name):
"""
Merge monthly Bio-Argo parquet files for a single variable (e.g., CHLA or BBP700).
Parameters
----------
var_dir : str or Path
Directory containing monthly parquet files like 'argo_bgc_global_YYYY-MM.parquet'.
out_path : str or Path
Output parquet path for the merged dataset.
var_name : str
Name of the BGC variable column in the files (e.g., 'CHLA', 'BBP700').
"""
var_dir = Path(var_dir)
files = sorted(var_dir.glob("argo_bgc_global_*.parquet"))
if not files:
raise FileNotFoundError(f"No parquet files found in {var_dir}")
dfs = []
for f in files:
print(f"Reading {f}")
df = pd.read_parquet(f)
dfs.append(df)
# Concatenate all months
df_all = pd.concat(dfs, ignore_index=True)
# Core columns you want
base_cols = [
"PLATFORM_NUMBER", "CYCLE_NUMBER",
"TIME", "LATITUDE", "LONGITUDE", "PRES",
]
# TEMP and PSAL if present, otherwise weβll add as NaN
extra_cols = ["TEMP", "PSAL", var_name]
# Ensure all expected columns exist; They should be there, but just in case
for c in base_cols + extra_cols:
if c not in df_all.columns:
df_all[c] = np.nan
# Final column order
cols = base_cols + extra_cols
df_all = df_all[cols]
# Save merged parquet
out_path = Path(out_path)
out_path.parent.mkdir(parents=True, exist_ok=True)
df_all.to_parquet(out_path, index=False)
print(f"Saved merged dataset with {len(df_all)} rows to {out_path}")
%%script false --no-raise-error
# %%script to prevent accidentally rerunning this code
outfile = "tutorial_data/argo_bgc_global_surface_BBP700.parquet"
merge_bgc_monthlies("data/bbp700", outfile, "BBP700")
outfile = "tutorial_data/argo_bgc_global_surface_CHLA.parquet"
merge_bgc_monthlies("data/chla", outfile, "CHLA")
Reading data/bbp700/argo_bgc_global_2024-03.parquet
Reading data/bbp700/argo_bgc_global_2024-04.parquet
Reading data/bbp700/argo_bgc_global_2024-05.parquet
Reading data/bbp700/argo_bgc_global_2024-06.parquet
Reading data/bbp700/argo_bgc_global_2024-07.parquet
Reading data/bbp700/argo_bgc_global_2024-08.parquet
Reading data/bbp700/argo_bgc_global_2024-09.parquet
Reading data/bbp700/argo_bgc_global_2024-10.parquet
Reading data/bbp700/argo_bgc_global_2024-11.parquet
Reading data/bbp700/argo_bgc_global_2024-12.parquet
Reading data/bbp700/argo_bgc_global_2025-01.parquet
Reading data/bbp700/argo_bgc_global_2025-02.parquet
Reading data/bbp700/argo_bgc_global_2025-03.parquet
Reading data/bbp700/argo_bgc_global_2025-04.parquet
Reading data/bbp700/argo_bgc_global_2025-05.parquet
Reading data/bbp700/argo_bgc_global_2025-06.parquet
Reading data/bbp700/argo_bgc_global_2025-07.parquet
Reading data/bbp700/argo_bgc_global_2025-08.parquet
Reading data/bbp700/argo_bgc_global_2025-09.parquet
Reading data/bbp700/argo_bgc_global_2025-10.parquet
Reading data/bbp700/argo_bgc_global_2025-11.parquet
Saved merged dataset with 33067 rows to tutorial_data/argo_bgc_global_surface_BBP700.parquet
Reading data/chla/argo_bgc_global_2024-03.parquet
Reading data/chla/argo_bgc_global_2024-04.parquet
Reading data/chla/argo_bgc_global_2024-05.parquet
Reading data/chla/argo_bgc_global_2024-06.parquet
Reading data/chla/argo_bgc_global_2024-07.parquet
Reading data/chla/argo_bgc_global_2024-08.parquet
Reading data/chla/argo_bgc_global_2024-09.parquet
Reading data/chla/argo_bgc_global_2024-10.parquet
Reading data/chla/argo_bgc_global_2024-11.parquet
Reading data/chla/argo_bgc_global_2024-12.parquet
Reading data/chla/argo_bgc_global_2025-01.parquet
Reading data/chla/argo_bgc_global_2025-02.parquet
Reading data/chla/argo_bgc_global_2025-03.parquet
Reading data/chla/argo_bgc_global_2025-04.parquet
Reading data/chla/argo_bgc_global_2025-05.parquet
Reading data/chla/argo_bgc_global_2025-06.parquet
Reading data/chla/argo_bgc_global_2025-07.parquet
Reading data/chla/argo_bgc_global_2025-08.parquet
Reading data/chla/argo_bgc_global_2025-09.parquet
Reading data/chla/argo_bgc_global_2025-10.parquet
Reading data/chla/argo_bgc_global_2025-11.parquet
Saved merged dataset with 13375 rows to tutorial_data/argo_bgc_global_surface_CHLA.parquet
Load data#
If you have the data file on your computer, you load as below. However the parquet files are in the tutorials data directory in our GitHub repository and you can load from that.
%%script false --no-raise-error
# Load from local file
from pathlib import Path
import numpy as np
import pandas as pd
file = "tutorial_data/argo_bgc_global_surface_CHLA.parquet"
df = pd.read_parquet(file)
df.shape
# Load data from GitHub
import pandas as pd
url = "https://raw.githubusercontent.com/fish-pace/2025-tutorials/main/tutorial_data/argo_bgc_global_surface_CHLA.parquet"
df = pd.read_parquet(url)
df.head()
Plot the data#
Some regions are better represented that others.
# plot the data
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
import cartopy.feature as cfeature
proj = ccrs.PlateCarree()
fig = plt.figure(figsize=(8, 6))
ax = plt.axes(projection=proj)
# Add coastlines and land
ax.coastlines(resolution="110m")
ax.add_feature(cfeature.LAND, facecolor="0.9")
ax.add_feature(cfeature.OCEAN, facecolor="white")
ax.gridlines(draw_labels=True, linestyle="--", alpha=0.5)
# Add the Argo points from df
ax.scatter(
df["LONGITUDE"].values,
df["LATITUDE"].values,
s=1,
marker=".",
transform=proj,
)
ax.set_title("Argo CHLA data")
plt.show()
Summary#
Now we have in-situ CHLA and BBP700 data. For our machine-learning, we need to match these points to PACE data and extract the PACE variables for each lat/lon/time point.