Quickstart
point-collocation gets matchups to lat/lon using the pixel center that is closest to the lat/lon point using a distance metric. For time, you can select a buffer of 0, which means the time of the point must be within the time range of the file or a buffer like buffer="1D" to find files within 1 day of the point. Using a buffer can help for L2 files with short windows (minutes) or collections with infrequent files.
- Create a plan for files to use
pc.plan() - Print the plan to check it
plan.summary() - Get matchups for variables
pc.matchup(plan, variables=['var'])
Note: In a virtual machine in AWS us-west-2, where NASA cloud data is, the point matchups are fast. In Colab, say, your compute is not in the same data region nor provider, and the same matchups might take 10x longer.
Prerequisites
The examples here use NASA EarthData and you need to have an account with EarthData. Make sure you can login.
<earthaccess.auth.Auth at 0x7f6b2c60c920>
Get some points to matchup
point-collocation assumes that your points locations in your dataframe are lat/lon. It will look for lat, latitude, Latitude, or LATITUDE for the y-axis and your lon, longitude, Longitude, or LONGITUDE for the x-axis. For time, it looks for time or date. You can tell it otherwise if needed. See section on coords_spec.
import pandas as pd
url = (
"https://raw.githubusercontent.com/"
"fish-pace/point-collocation/main/"
"examples/fixtures/points.csv"
)
df_points = pd.read_csv(url)
print(len(df_points))
df_points.head()
595
| lat | lon | date | |
|---|---|---|---|
| 0 | 27.3835 | -82.7375 | 2024-06-13 |
| 1 | 27.1190 | -82.7125 | 2024-06-14 |
| 2 | 26.9435 | -82.8170 | 2024-06-14 |
| 3 | 26.6875 | -82.8065 | 2024-06-14 |
| 4 | 26.6675 | -82.6455 | 2024-06-14 |
Start plan -- Take a look at the files in a collection
Now we use the point_collocation package. First we will look at the files available and figure out which ones we want. For data_source earthaccess, it uses the CMR metadata to fine the granules that overlap with the lat/lon in the points dataframe.
%%time
import point_collocation as pc
plan = pc.plan(
df_points,
data_source="earthaccess",
source_kwargs={
"short_name": "PACE_OCI_L3M_RRS",
}
)
CPU times: user 634 ms, sys: 88.2 ms, total: 722 ms
Wall time: 2.23 s
Plan: 595 points → 210 unique granule(s)
Points with 0 matches : 0
Points with >1 matches: 595
Time buffer: 0 days 00:00:00
First 1 point(s):
[0] lat=27.3835, lon=-82.7375, time=2024-06-13 12:00:00: 16 match(es)
→ https://obdaac-tea.earthdatacloud.nasa.gov/ob-cumulus-prod-public/PACE_OCI.20240321_20240620.L3m.SNSP.RRS.V3_1.Rrs.0p1deg.nc
→ https://obdaac-tea.earthdatacloud.nasa.gov/ob-cumulus-prod-public/PACE_OCI.20240321_20240620.L3m.SNSP.RRS.V3_1.Rrs.4km.nc
→ https://obdaac-tea.earthdatacloud.nasa.gov/ob-cumulus-prod-public/PACE_OCI.20240516_20240616.L3m.R32.RRS.V3_1.Rrs.0p1deg.nc
→ https://obdaac-tea.earthdatacloud.nasa.gov/ob-cumulus-prod-public/PACE_OCI.20240516_20240616.L3m.R32.RRS.V3_1.Rrs.4km.nc
→ https://obdaac-tea.earthdatacloud.nasa.gov/ob-cumulus-prod-public/PACE_OCI.20240524_20240624.L3m.R32.RRS.V3_1.Rrs.0p1deg.nc
→ https://obdaac-tea.earthdatacloud.nasa.gov/ob-cumulus-prod-public/PACE_OCI.20240524_20240624.L3m.R32.RRS.V3_1.Rrs.4km.nc
→ https://obdaac-tea.earthdatacloud.nasa.gov/ob-cumulus-prod-public/PACE_OCI.20240601_20240630.L3m.MO.RRS.V3_1.Rrs.0p1deg.nc
→ https://obdaac-tea.earthdatacloud.nasa.gov/ob-cumulus-prod-public/PACE_OCI.20240601_20240630.L3m.MO.RRS.V3_1.Rrs.4km.nc
→ https://obdaac-tea.earthdatacloud.nasa.gov/ob-cumulus-prod-public/PACE_OCI.20240601_20240702.L3m.R32.RRS.V3_1.Rrs.4km.nc
→ https://obdaac-tea.earthdatacloud.nasa.gov/ob-cumulus-prod-public/PACE_OCI.20240601_20240702.L3m.R32.RRS.V3_1.Rrs.0p1deg.nc
→ https://obdaac-tea.earthdatacloud.nasa.gov/ob-cumulus-prod-public/PACE_OCI.20240609_20240616.L3m.8D.RRS.V3_1.Rrs.0p1deg.nc
→ https://obdaac-tea.earthdatacloud.nasa.gov/ob-cumulus-prod-public/PACE_OCI.20240609_20240616.L3m.8D.RRS.V3_1.Rrs.4km.nc
→ https://obdaac-tea.earthdatacloud.nasa.gov/ob-cumulus-prod-public/PACE_OCI.20240609_20240710.L3m.R32.RRS.V3_1.Rrs.0p1deg.nc
→ https://obdaac-tea.earthdatacloud.nasa.gov/ob-cumulus-prod-public/PACE_OCI.20240609_20240710.L3m.R32.RRS.V3_1.Rrs.4km.nc
→ https://obdaac-tea.earthdatacloud.nasa.gov/ob-cumulus-prod-public/PACE_OCI.20240613.L3m.DAY.RRS.V3_1.Rrs.0p1deg.nc
→ https://obdaac-tea.earthdatacloud.nasa.gov/ob-cumulus-prod-public/PACE_OCI.20240613.L3m.DAY.RRS.V3_1.Rrs.4km.nc
Create new plan with filter on file names
We will use the monthly 4km files.
%%time
import point_collocation as pc
plan = pc.plan(
df_points,
data_source="earthaccess",
source_kwargs={
"short_name": "PACE_OCI_L3M_RRS",
"granule_name": "*.MO.*.4km.*",
}
)
CPU times: user 44.5 ms, sys: 10.9 ms, total: 55.4 ms
Wall time: 9.68 s
# check the plan and see how many files per point
# we want 1 file per point in this case
# Looks like 6 monthly files
plan.summary()
Plan: 595 points → 4 unique granule(s)
Points with 0 matches : 0
Points with >1 matches: 0
Time buffer: 0 days 00:00:00
First 5 point(s):
[0] lat=27.3835, lon=-82.7375, time=2024-06-13 12:00:00: 1 match(es)
→ https://obdaac-tea.earthdatacloud.nasa.gov/ob-cumulus-prod-public/PACE_OCI.20240601_20240630.L3m.MO.RRS.V3_1.Rrs.4km.nc
[1] lat=27.1190, lon=-82.7125, time=2024-06-14 12:00:00: 1 match(es)
→ https://obdaac-tea.earthdatacloud.nasa.gov/ob-cumulus-prod-public/PACE_OCI.20240601_20240630.L3m.MO.RRS.V3_1.Rrs.4km.nc
[2] lat=26.9435, lon=-82.8170, time=2024-06-14 12:00:00: 1 match(es)
→ https://obdaac-tea.earthdatacloud.nasa.gov/ob-cumulus-prod-public/PACE_OCI.20240601_20240630.L3m.MO.RRS.V3_1.Rrs.4km.nc
[3] lat=26.6875, lon=-82.8065, time=2024-06-14 12:00:00: 1 match(es)
→ https://obdaac-tea.earthdatacloud.nasa.gov/ob-cumulus-prod-public/PACE_OCI.20240601_20240630.L3m.MO.RRS.V3_1.Rrs.4km.nc
[4] lat=26.6675, lon=-82.6455, time=2024-06-14 12:00:00: 1 match(es)
→ https://obdaac-tea.earthdatacloud.nasa.gov/ob-cumulus-prod-public/PACE_OCI.20240601_20240630.L3m.MO.RRS.V3_1.Rrs.4km.nc
Check the variables in the files
This will open one file and show us the variables. We want 'Rrs' in this case.
open_method: {'xarray_open': 'dataset', 'open_kwargs': {'chunks': {}, 'engine': 'h5netcdf', 'decode_timedelta': False}, 'coords': 'auto', 'set_coords': True, 'dim_renames': None, 'auto_align_phony_dims': None, 'merge': None}
Dimensions: {'lat': 4320, 'lon': 8640, 'wavelength': 172, 'rgb': 3, 'eightbitcolor': 256}
Variables: ['Rrs', 'palette']
Geolocation: ('lon', 'lat') — lon dims=('lon',), lat dims=('lat',)
Get the matchups using our plan
Let's start with 100 points since 595 might take awhile. The lat, lon, and time for the matching granules is added as a column. pc_id is the point id/row from the data you are matching. This is added in case there are multiple granules (files) per data point.
CPU times: user 12.3 s, sys: 992 ms, total: 13.3 s
Wall time: 18.8 s
| lat | lon | time | pc_id | granule_id | granule_time | granule_lat | granule_lon | Rrs_346 | Rrs_348 | ... | Rrs_706 | Rrs_707 | Rrs_708 | Rrs_709 | Rrs_711 | Rrs_712 | Rrs_713 | Rrs_714 | Rrs_717 | Rrs_719 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 27.3835 | -82.7375 | 2024-06-13 12:00:00 | 0 | https://obdaac-tea.earthdatacloud.nasa.gov/ob-... | 2024-06-15 23:59:59+00:00 | 27.395832 | -82.729164 | 0.004034 | 0.004070 | ... | 0.000224 | 0.000202 | 0.000190 | 0.000176 | 0.000168 | 0.000156 | 0.000144 | 0.000134 | 0.000158 | 0.000202 |
| 1 | 27.1190 | -82.7125 | 2024-06-14 12:00:00 | 1 | https://obdaac-tea.earthdatacloud.nasa.gov/ob-... | 2024-06-15 23:59:59+00:00 | 27.104164 | -82.729164 | 0.004562 | 0.004616 | ... | 0.000108 | 0.000094 | 0.000084 | 0.000078 | 0.000072 | 0.000066 | 0.000060 | 0.000048 | 0.000062 | 0.000098 |
| 2 | 26.9435 | -82.8170 | 2024-06-14 12:00:00 | 2 | https://obdaac-tea.earthdatacloud.nasa.gov/ob-... | 2024-06-15 23:59:59+00:00 | 26.937498 | -82.812500 | 0.005112 | 0.005282 | ... | 0.000118 | 0.000108 | 0.000102 | 0.000098 | 0.000098 | 0.000092 | 0.000086 | 0.000068 | 0.000052 | 0.000066 |
| 3 | 26.6875 | -82.8065 | 2024-06-14 12:00:00 | 3 | https://obdaac-tea.earthdatacloud.nasa.gov/ob-... | 2024-06-15 23:59:59+00:00 | 26.687498 | -82.812500 | 0.004648 | 0.004904 | ... | 0.000178 | 0.000158 | 0.000148 | 0.000138 | 0.000130 | 0.000126 | 0.000126 | 0.000120 | 0.000158 | 0.000230 |
| 4 | 26.6675 | -82.6455 | 2024-06-14 12:00:00 | 4 | https://obdaac-tea.earthdatacloud.nasa.gov/ob-... | 2024-06-15 23:59:59+00:00 | 26.687498 | -82.645828 | 0.004944 | 0.005064 | ... | 0.000094 | 0.000078 | 0.000068 | 0.000062 | 0.000058 | 0.000054 | 0.000052 | 0.000050 | 0.000106 | 0.000166 |
5 rows × 180 columns
Open files in plan
Sometimes it is helpful to look at the granules. There are helper functions for that. You need to specify the format of the data, "grid" for level 3 gridded or "swath" for level 2 swath data.
<xarray.Dataset> Size: 26GB
Dimensions: (lat: 4320, lon: 8640, wavelength: 172, rgb: 3,
eightbitcolor: 256)
Coordinates:
* lat (lat) float32 17kB 89.98 89.94 89.9 ... -89.9 -89.94 -89.98
* lon (lon) float32 35kB -180.0 -179.9 -179.9 ... 179.9 179.9 180.0
* wavelength (wavelength) float64 1kB 346.0 348.0 351.0 ... 714.0 717.0 719.0
Dimensions without coordinates: rgb, eightbitcolor
Data variables:
Rrs (lat, lon, wavelength) float32 26GB dask.array<chunksize=(16, 1024, 8), meta=np.ndarray>
palette (rgb, eightbitcolor) uint8 768B dask.array<chunksize=(3, 256), meta=np.ndarray>
Attributes: (12/64)
product_name: PACE_OCI.20240601_20240630.L3m.MO.RRS....
instrument: OCI
title: OCI Level-3 Standard Mapped Image
project: Ocean Biology Processing Group (NASA/G...
platform: PACE
source: satellite observations from OCI-PACE
... ...
identifier_product_doi: 10.5067/PACE/OCI/L3M/RRS/3.1
keywords: Earth Science > Oceans > Ocean Optics ...
keywords_vocabulary: NASA Global Change Master Directory (G...
data_bins: 16464585
data_minimum: -0.009998
data_maximum: 0.09856601If you need to specify the coordinate names
You don't control the names of the coordinates in the source granules. If point-collocation cannot figure out the coordinates, then you will have to pass in coord_spec to pc.matchup() of plan.open_dataset(). The full coord_spec dictionary looks like this.
coord_spec = {
"coordinate_system": "geographic", # default; y=lat(deg), x=lon(deg); other option will be "planar" but that is not built yet
"y": {"source": "auto", "points": "auto"},
"x": {"source": "auto", "points": "auto"},
"time": {"source": "auto", "points": "auto"},
# Additional coords like depth
}
Edit with the actual coordinate names. If there are additional coordinates in the dataset and you want to match those, you can add them. Your data variable that you are matching needs to have this coordinate, e.g. be (lat, lon, z). source is what the coord is called in the granule file and points is what you call it in your dataframe.
coord_spec = {
"coordinate_system": "geographic", # default; y=lat(deg), x=lon(deg); other option will be "planar" but that is not built yet
"y": {"source": "grid_lat", "points": "lat"},
"x": {"source": "grid_lon", "points": "lon"},
"time": {"source": "time", "points": "TOD"},
"depth": {"source": "z", "points": "depth"}, # optional
}
You can also pass lat/lon coords names to open_method(). Like this