Quickstart

point-collocation gets matchups to lat/lon using the pixel center that is closest to the lat/lon point using a distance metric. For time, you can select a buffer of 0, which means the time of the point must be within the time range of the file or a buffer like buffer="1D" to find files within 1 day of the point. Using a buffer can help for L2 files with short windows (minutes) or collections with infrequent files.

Create a plan for files to use pc.plan()
Print the plan to check it plan.summary()
Get matchups for variables pc.matchup(plan, variables=['var'])

Note: In a virtual machine in AWS us-west-2, where NASA cloud data is, the point matchups are fast. In Colab, say, your compute is not in the same data region nor provider, and the same matchups might take 10x longer.

Prerequisites

The examples here use NASA EarthData and you need to have an account with EarthData. Make sure you can login.

# if needed
!pip install point-collocation

import earthaccess
earthaccess.login()

<earthaccess.auth.Auth at 0x7f6b2c60c920>

Get some points to matchup

point-collocation assumes that your points locations in your dataframe are lat/lon. It will look for lat, latitude, Latitude, or LATITUDE for the y-axis and your lon, longitude, Longitude, or LONGITUDE for the x-axis. For time, it looks for time or date. You can tell it otherwise if needed. See section on coords_spec.

import pandas as pd
url = (
    "https://raw.githubusercontent.com/"
    "fish-pace/point-collocation/main/"
    "examples/fixtures/points.csv"
)
df_points = pd.read_csv(url)
print(len(df_points))
df_points.head()

	lat	lon	date
0	27.3835	-82.7375	2024-06-13
1	27.1190	-82.7125	2024-06-14
2	26.9435	-82.8170	2024-06-14
3	26.6875	-82.8065	2024-06-14
4	26.6675	-82.6455	2024-06-14

Start plan -- Take a look at the files in a collection

Now we use the point_collocation package. First we will look at the files available and figure out which ones we want. For data_source earthaccess, it uses the CMR metadata to fine the granules that overlap with the lat/lon in the points dataframe.

%%time
import point_collocation as pc
plan = pc.plan(
    df_points,
    data_source="earthaccess",
    source_kwargs={
        "short_name": "PACE_OCI_L3M_RRS",
    }
)

CPU times: user 634 ms, sys: 88.2 ms, total: 722 ms
Wall time: 2.23 s

plan.summary(n=1)

Plan: 595 points → 210 unique granule(s)
  Points with 0 matches : 0
  Points with >1 matches: 595
  Time buffer: 0 days 00:00:00

First 1 point(s):
  [0] lat=27.3835, lon=-82.7375, time=2024-06-13 12:00:00: 16 match(es)
    → https://obdaac-tea.earthdatacloud.nasa.gov/ob-cumulus-prod-public/PACE_OCI.20240321_20240620.L3m.SNSP.RRS.V3_1.Rrs.0p1deg.nc
    → https://obdaac-tea.earthdatacloud.nasa.gov/ob-cumulus-prod-public/PACE_OCI.20240321_20240620.L3m.SNSP.RRS.V3_1.Rrs.4km.nc
    → https://obdaac-tea.earthdatacloud.nasa.gov/ob-cumulus-prod-public/PACE_OCI.20240516_20240616.L3m.R32.RRS.V3_1.Rrs.0p1deg.nc
    → https://obdaac-tea.earthdatacloud.nasa.gov/ob-cumulus-prod-public/PACE_OCI.20240516_20240616.L3m.R32.RRS.V3_1.Rrs.4km.nc
    → https://obdaac-tea.earthdatacloud.nasa.gov/ob-cumulus-prod-public/PACE_OCI.20240524_20240624.L3m.R32.RRS.V3_1.Rrs.0p1deg.nc
    → https://obdaac-tea.earthdatacloud.nasa.gov/ob-cumulus-prod-public/PACE_OCI.20240524_20240624.L3m.R32.RRS.V3_1.Rrs.4km.nc
    → https://obdaac-tea.earthdatacloud.nasa.gov/ob-cumulus-prod-public/PACE_OCI.20240601_20240630.L3m.MO.RRS.V3_1.Rrs.0p1deg.nc
    → https://obdaac-tea.earthdatacloud.nasa.gov/ob-cumulus-prod-public/PACE_OCI.20240601_20240630.L3m.MO.RRS.V3_1.Rrs.4km.nc
    → https://obdaac-tea.earthdatacloud.nasa.gov/ob-cumulus-prod-public/PACE_OCI.20240601_20240702.L3m.R32.RRS.V3_1.Rrs.4km.nc
    → https://obdaac-tea.earthdatacloud.nasa.gov/ob-cumulus-prod-public/PACE_OCI.20240601_20240702.L3m.R32.RRS.V3_1.Rrs.0p1deg.nc
    → https://obdaac-tea.earthdatacloud.nasa.gov/ob-cumulus-prod-public/PACE_OCI.20240609_20240616.L3m.8D.RRS.V3_1.Rrs.0p1deg.nc
    → https://obdaac-tea.earthdatacloud.nasa.gov/ob-cumulus-prod-public/PACE_OCI.20240609_20240616.L3m.8D.RRS.V3_1.Rrs.4km.nc
    → https://obdaac-tea.earthdatacloud.nasa.gov/ob-cumulus-prod-public/PACE_OCI.20240609_20240710.L3m.R32.RRS.V3_1.Rrs.0p1deg.nc
    → https://obdaac-tea.earthdatacloud.nasa.gov/ob-cumulus-prod-public/PACE_OCI.20240609_20240710.L3m.R32.RRS.V3_1.Rrs.4km.nc
    → https://obdaac-tea.earthdatacloud.nasa.gov/ob-cumulus-prod-public/PACE_OCI.20240613.L3m.DAY.RRS.V3_1.Rrs.0p1deg.nc
    → https://obdaac-tea.earthdatacloud.nasa.gov/ob-cumulus-prod-public/PACE_OCI.20240613.L3m.DAY.RRS.V3_1.Rrs.4km.nc

Create new plan with filter on file names

We will use the monthly 4km files.

%%time
import point_collocation as pc
plan = pc.plan(
    df_points,
    data_source="earthaccess",
    source_kwargs={
        "short_name": "PACE_OCI_L3M_RRS",
        "granule_name": "*.MO.*.4km.*",
    }
)

CPU times: user 44.5 ms, sys: 10.9 ms, total: 55.4 ms
Wall time: 9.68 s

# check the plan and see how many files per point
# we want 1 file per point in this case
# Looks like 6 monthly files
plan.summary()

Plan: 595 points → 4 unique granule(s)
  Points with 0 matches : 0
  Points with >1 matches: 0
  Time buffer: 0 days 00:00:00

First 5 point(s):
  [0] lat=27.3835, lon=-82.7375, time=2024-06-13 12:00:00: 1 match(es)
    → https://obdaac-tea.earthdatacloud.nasa.gov/ob-cumulus-prod-public/PACE_OCI.20240601_20240630.L3m.MO.RRS.V3_1.Rrs.4km.nc
  [1] lat=27.1190, lon=-82.7125, time=2024-06-14 12:00:00: 1 match(es)
    → https://obdaac-tea.earthdatacloud.nasa.gov/ob-cumulus-prod-public/PACE_OCI.20240601_20240630.L3m.MO.RRS.V3_1.Rrs.4km.nc
  [2] lat=26.9435, lon=-82.8170, time=2024-06-14 12:00:00: 1 match(es)
    → https://obdaac-tea.earthdatacloud.nasa.gov/ob-cumulus-prod-public/PACE_OCI.20240601_20240630.L3m.MO.RRS.V3_1.Rrs.4km.nc
  [3] lat=26.6875, lon=-82.8065, time=2024-06-14 12:00:00: 1 match(es)
    → https://obdaac-tea.earthdatacloud.nasa.gov/ob-cumulus-prod-public/PACE_OCI.20240601_20240630.L3m.MO.RRS.V3_1.Rrs.4km.nc
  [4] lat=26.6675, lon=-82.6455, time=2024-06-14 12:00:00: 1 match(es)
    → https://obdaac-tea.earthdatacloud.nasa.gov/ob-cumulus-prod-public/PACE_OCI.20240601_20240630.L3m.MO.RRS.V3_1.Rrs.4km.nc

Check the variables in the files

This will open one file and show us the variables. We want 'Rrs' in this case.

plan.open_dataset(0)

open_method: {'xarray_open': 'dataset', 'open_kwargs': {'chunks': {}, 'engine': 'h5netcdf', 'decode_timedelta': False}, 'coords': 'auto', 'set_coords': True, 'dim_renames': None, 'auto_align_phony_dims': None, 'merge': None}

Dimensions: {'lat': 4320, 'lon': 8640, 'wavelength': 172, 'rgb': 3, 'eightbitcolor': 256}

Variables: ['Rrs', 'palette']

Geolocation: ('lon', 'lat') — lon dims=('lon',), lat dims=('lat',)

Get the matchups using our plan

Let's start with 100 points since 595 might take awhile. The lat, lon, and time for the matching granules is added as a column. pc_id is the point id/row from the data you are matching. This is added in case there are multiple granules (files) per data point.

%%time
res = pc.matchup(plan[0:100], variables=["Rrs"])

CPU times: user 12.3 s, sys: 992 ms, total: 13.3 s
Wall time: 18.8 s

res.head()

	lat	lon	time	pc_id	granule_id	granule_time	granule_lat	granule_lon	Rrs_346	Rrs_348	...	Rrs_706	Rrs_707	Rrs_708	Rrs_709	Rrs_711	Rrs_712	Rrs_713	Rrs_714	Rrs_717	Rrs_719
0	27.3835	-82.7375	2024-06-13 12:00:00	0	https://obdaac-tea.earthdatacloud.nasa.gov/ob-...	2024-06-15 23:59:59+00:00	27.395832	-82.729164	0.004034	0.004070	...	0.000224	0.000202	0.000190	0.000176	0.000168	0.000156	0.000144	0.000134	0.000158	0.000202
1	27.1190	-82.7125	2024-06-14 12:00:00	1	https://obdaac-tea.earthdatacloud.nasa.gov/ob-...	2024-06-15 23:59:59+00:00	27.104164	-82.729164	0.004562	0.004616	...	0.000108	0.000094	0.000084	0.000078	0.000072	0.000066	0.000060	0.000048	0.000062	0.000098
2	26.9435	-82.8170	2024-06-14 12:00:00	2	https://obdaac-tea.earthdatacloud.nasa.gov/ob-...	2024-06-15 23:59:59+00:00	26.937498	-82.812500	0.005112	0.005282	...	0.000118	0.000108	0.000102	0.000098	0.000098	0.000092	0.000086	0.000068	0.000052	0.000066
3	26.6875	-82.8065	2024-06-14 12:00:00	3	https://obdaac-tea.earthdatacloud.nasa.gov/ob-...	2024-06-15 23:59:59+00:00	26.687498	-82.812500	0.004648	0.004904	...	0.000178	0.000158	0.000148	0.000138	0.000130	0.000126	0.000126	0.000120	0.000158	0.000230
4	26.6675	-82.6455	2024-06-14 12:00:00	4	https://obdaac-tea.earthdatacloud.nasa.gov/ob-...	2024-06-15 23:59:59+00:00	26.687498	-82.645828	0.004944	0.005064	...	0.000094	0.000078	0.000068	0.000062	0.000058	0.000054	0.000052	0.000050	0.000106	0.000166

5 rows × 180 columns

Open files in plan

Sometimes it is helpful to look at the granules. There are helper functions for that. You need to specify the format of the data, "grid" for level 3 gridded or "swath" for level 2 swath data.

ds = plan.open_dataset(0)
ds

<xarray.Dataset> Size: 26GB
Dimensions:     (lat: 4320, lon: 8640, wavelength: 172, rgb: 3,
                 eightbitcolor: 256)
Coordinates:
  * lat         (lat) float32 17kB 89.98 89.94 89.9 ... -89.9 -89.94 -89.98
  * lon         (lon) float32 35kB -180.0 -179.9 -179.9 ... 179.9 179.9 180.0
  * wavelength  (wavelength) float64 1kB 346.0 348.0 351.0 ... 714.0 717.0 719.0
Dimensions without coordinates: rgb, eightbitcolor
Data variables:
    Rrs         (lat, lon, wavelength) float32 26GB dask.array<chunksize=(16, 1024, 8), meta=np.ndarray>
    palette     (rgb, eightbitcolor) uint8 768B dask.array<chunksize=(3, 256), meta=np.ndarray>
Attributes: (12/64)
    product_name:                      PACE_OCI.20240601_20240630.L3m.MO.RRS....
    instrument:                        OCI
    title:                             OCI Level-3 Standard Mapped Image
    project:                           Ocean Biology Processing Group (NASA/G...
    platform:                          PACE
    source:                            satellite observations from OCI-PACE
    ...                                ...
    identifier_product_doi:            10.5067/PACE/OCI/L3M/RRS/3.1
    keywords:                          Earth Science > Oceans > Ocean Optics ...
    keywords_vocabulary:               NASA Global Change Master Directory (G...
    data_bins:                         16464585
    data_minimum:                      -0.009998
    data_maximum:                      0.09856601

If you need to specify the coordinate names

You don't control the names of the coordinates in the source granules. If point-collocation cannot figure out the coordinates, then you will have to pass in coord_spec to pc.matchup() of plan.open_dataset(). The full coord_spec dictionary looks like this.

coord_spec = {
    "coordinate_system": "geographic",  # default; y=lat(deg), x=lon(deg); other option will be "planar" but that is not built yet
    "y": {"source": "auto", "points": "auto"},
    "x": {"source": "auto", "points": "auto"},
    "time": {"source": "auto", "points": "auto"},
    # Additional coords like depth
}

Edit with the actual coordinate names. If there are additional coordinates in the dataset and you want to match those, you can add them. Your data variable that you are matching needs to have this coordinate, e.g. be (lat, lon, z). source is what the coord is called in the granule file and points is what you call it in your dataframe.

coord_spec = {
    "coordinate_system": "geographic",  # default; y=lat(deg), x=lon(deg); other option will be "planar" but that is not built yet
    "y": {"source": "grid_lat", "points": "lat"},
    "x": {"source": "grid_lon", "points": "lon"},
    "time": {"source": "time", "points": "TOD"},       
    "depth": {"source": "z", "points": "depth"},         # optional
}

You can also pass lat/lon coords names to open_method(). Like this

icesat2_atl21 = {
    'xarray_open': 'dataset',
    'merge': ['/', '/monthly'],
    'coords': {'lat': 'grid_lat', 'lon': 'grid_lon'},
    'set_coords': True,
    'open_kwargs': {'phony_dims':'access'}
}
# ds = plan.open_dataset(0, open_method=icesat2_atl21)