clouddrift.ragged.subset

Contents

clouddrift.ragged.subset#

clouddrift.ragged.subset(ds: Dataset, criteria: dict, id_var_name: str = 'id', rowsize_var_name: str = 'rowsize', row_dim_name: str = 'rows', obs_dim_name: str = 'obs', full_rows=False) Dataset[source]#

Subset a ragged array xarray dataset as a function of one or more criteria. The criteria are passed with a dictionary, where a dictionary key is a variable to subset and the associated dictionary value is either a range (valuemin, valuemax), a list [value1, value2, valueN], a single value, or a masking function applied to any variable of the dataset.

This function needs to know the names of the dimensions of the ragged array dataset (row_dim_name and obs_dim_name), and the name of the rowsize variable (rowsize_var_name). Default values corresponds to the clouddrift convention (“rows”, “obs”, and “rowsize”) but should be changed as needed.

Parameters#

dsxr.Dataset

Xarray dataset composed of ragged arrays.

criteriadict

Dictionary containing the variables (as keys) and the ranges/values/functions (as values) to subset.

id_var_namestr, optional

Name of the variable with dimension row_dim_name containing the identification number of the rows (default is “id”).

rowsize_var_namestr, optional

Name of the variable containing the number of observations per row (default is “rowsize”).

row_dim_namestr, optional

Name of the row dimension (default is “rows”).

obs_dim_namestr, optional

Name of the observation dimension (default is “obs”).

full_rowsbool, optional

If True, the function returns complete rows for which the criteria are matched at least once. Default is False which means that only segments matching the criteria are returned when filtering along the observation dimension.

Returns#

xr.Dataset

Subset xarray dataset matching the criterion(a).

Examples#

Criteria are combined on any data (with dimension “obs”) or metadata (with dimension “rows”) variables part of the Dataset. The following examples are based on NOAA GDP datasets which can be accessed with the clouddrift.datasets module. In these datasets, each row of the ragged arrays corresponds to the data from a single drifter trajectory and the row_dim_name is “traj” and the obs_dim_name is “obs”.

Retrieve a region, like the Gulf of Mexico, using ranges of latitude and longitude:

>>> subset(ds, {"lat": (21, 31), "lon": (-98, -78)}, row_dim_name="traj")

The parameter full_rows can be used to retrieve trajectories passing through a region, for example all trajectories passing through the Gulf of Mexico:

>>> subset(ds, {"lat": (21, 31), "lon": (-98, -78)}, full_rows=True, row_dim_name="traj")

Retrieve drogued trajectory segments:

>>> subset(ds, {"drogue_status": True}, row_dim_name="traj")

Retrieve trajectory segments with temperature higher than 25°C (303.15K):

>>> subset(ds, {"sst": (303.15, np.inf)}, row_dim_name="traj")

You can use the same approach to return only the trajectories that are shorter than some number of observations (similar to prune() but for the entire dataset):

>>> subset(ds, {"rowsize": (0, 1000)}, row_dim_name="traj")

Retrieve specific drifters using their IDs:

>>> subset(ds, {"id": [2578, 2582, 2583]}, row_dim_name="traj")

Sometimes, you may want to retrieve specific rows of a ragged array. You can do that by filtering along the trajectory dimension directly, since this one corresponds to row numbers:

>>> rows = [5, 6, 7]
>>> subset(ds, {"traj": rows}, row_dim_name="traj")

Retrieve a specific time period:

>>> subset(ds, {"time": (np.datetime64("2000-01-01"), np.datetime64("2020-01-31"))}, row_dim_name="traj")

Note that to subset time variable, the range has to be defined as a function type of the variable. By default, xarray uses np.datetime64 to represent datetime data. If the datetime data is a datetime.datetime, or pd.Timestamp, the range would have to be defined accordingly.

Those criteria can also be combined:

>>> subset(ds, {"lat": (21, 31), "lon": (-98, -78), "drogue_status": True, "sst": (303.15, np.inf), "time": (np.datetime64("2000-01-01"), np.datetime64("2020-01-31"))}, row_dim_name="traj")

You can also use a function to filter the data. For example, retrieve every other observation of each trajectory:

>>> func = (lambda arr: ((arr - arr[0]) % 2) == 0)
>>> subset(ds, {"time": func}, row_dim_name="traj")

The filtering function can accept several input variables passed as a tuple. For example, retrieve drifters released in the Mediterranean Sea, but exclude those released in the Bay of Biscay and the Black Sea:

>>> def mediterranean_mask(lon: xr.DataArray, lat: xr.DataArray) -> xr.DataArray:
>>>     # Mediterranean Sea bounding box
>>>     in_med = np.logical_and(-6.0327 <= lon, np.logical_and(lon <= 36.2173,
>>>                                                            np.logical_and(30.2639 <= lat, lat <= 45.7833)))
>>>     # Bay of Biscay
>>>     in_biscay = np.logical_and(lon <= -0.1462, lat >= 43.2744)
>>>     # Black Sea
>>>     in_blacksea = np.logical_and(lon >= 27.4437, lat >= 40.9088)
>>>     return np.logical_and(in_med, np.logical_not(np.logical_or(in_biscay, in_blacksea)))
>>> subset(ds, {("start_lon", "start_lat"): mediterranean_mask}, row_dim_name="traj")

Raises#

ValueError

If one of the variable in a criterion is not found in the Dataset.

TypeError

If one of the criteria key is a tuple while its associated value is not a Callable criterion.

TypeError

If variables of a criterion key associated to a Callable do not share the same dimension.

See Also#

apply_ragged()