clouddrift.ragged#

Transformational and inquiry functions for ragged arrays.

Functions

`apply_ragged`(func, arrays, rowsize, *args[, ...])	Apply a function to a ragged array.
`chunk`(x, length[, overlap, align])	Divide an array `x` into equal chunks of length `length`.
`obs_index_to_row`(index, rowsize)	Obtain a list of row indices from a list of observation indices of a ragged array.
`prune`(ragged, rowsize, min_rowsize)	Within a ragged array, removes arrays less than a specified row size.
`ragged_to_regular`(ragged, rowsize[, fill_value])	Convert a ragged array to a two-dimensional array such that each contiguous segment of a ragged array is a row in the two-dimensional array.
`regular_to_ragged`(array[, fill_value])	Convert a two-dimensional array to a ragged array.
`rowsize_to_index`(rowsize)	Convert a list of row sizes to a list of indices.
`segment`(x, tolerance[, rowsize])	Divide an array into segments based on a tolerance value.
`subset`(ds, criteria[, id_var_name, ...])	Subset a ragged array xarray dataset as a function of one or more criteria.
`unpack`(ragged_array, rowsize[, rows, axis])	Unpack a ragged array into a list of regular arrays.

clouddrift.ragged.apply_ragged(func: callable, arrays: list[~numpy.ndarray | ~xarray.core.dataarray.DataArray] | ~numpy.ndarray | ~xarray.core.dataarray.DataArray, rowsize: list[int] | ~numpy.ndarray[int] | ~xarray.core.dataarray.DataArray, *args: tuple, rows: int | ~collections.abc.Iterable[int] = None, axis: int = 0, executor: ~concurrent.futures._base.Executor = <concurrent.futures.thread.ThreadPoolExecutor object>, **kwargs: dict) → tuple[ndarray] | ndarray[source]#

Apply a function to a ragged array.

The function func will be applied to each contiguous row of arrays as indicated by row sizes rowsize. The output of func will be concatenated into a single ragged array.

You can pass arrays as NumPy arrays or xarray DataArrays, however, the result will always be a NumPy array. Passing rows as an integer or a sequence of integers will make apply_ragged process and return only those specific rows, and otherwise, all rows in the input ragged array will be processed. Further, you can use the axis parameter to specify the ragged axis of the input array(s) (default is 0).

By default this function uses concurrent.futures.ThreadPoolExecutor to run func in multiple threads. The number of threads can be controlled by passing the max_workers argument to the executor instance passed to apply_ragged. Alternatively, you can pass the concurrent.futures.ProcessPoolExecutor instance to use processes instead. Passing alternative (3rd party library) concurrent executors may work if they follow the same executor interface as that of concurrent.futures, however this has not been tested yet.

Parameters#

funccallable: Function to apply to each row of each ragged array in arrays.
arrayslist[np.ndarray] or np.ndarray or xr.DataArray: An array or a list of arrays to apply func to.
rowsizelist[int] or np.ndarray[int] or xr.DataArray[int]: List of integers specifying the number of data points in each row.
*argstuple: Additional arguments to pass to func.
rowsint or Iterable[int], optional: The row(s) of the ragged array to apply func to. If rows is None (default), then func will be applied to all rows.
axisint, optional: The ragged axis of the input arrays. Default is 0.
executorconcurrent.futures.Executor, optional: Executor to use for concurrent execution. Default is ThreadPoolExecutor with the default number of max_workers. Another supported option is ProcessPoolExecutor.
**kwargsdict: Additional keyword arguments to pass to func.

Returns#

outtuple[np.ndarray] or np.ndarray: Output array(s) from func.

Examples#

Using velocity_from_position with apply_ragged, calculate the velocities of multiple particles, the coordinates of which are found in the ragged arrays x, y, and t that share row sizes 2, 3, and 4:

>>> from clouddrift.kinematics import velocity_from_position
>>> rowsize = [2, 3, 4]
>>> x = np.array([1, 2, 10, 12, 14, 30, 33, 36, 39])
>>> y = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8])
>>> t = np.array([1, 2, 1, 2, 3, 1, 2, 3, 4])
>>> u1, v1 = apply_ragged(velocity_from_position, [x, y, t], rowsize, coord_system="cartesian")
>>> u1
array([1., 1., 2., 2., 2., 3., 3., 3., 3.])
>>> v1
array([1., 1., 1., 1., 1., 1., 1., 1., 1.])

To apply func to only a subset of rows, use the rows argument:

>>> u1, v1 = apply_ragged(velocity_from_position, [x, y, t], rowsize, rows=0, coord_system="cartesian")
>>> u1
array([1., 1.])
>>> v1
array([1., 1.])
>>> u1, v1 = apply_ragged(velocity_from_position, [x, y, t], rowsize, rows=[0, 1], coord_system="cartesian")
>>> u1
array([1., 1., 2., 2., 2.])
>>> v1
array([1., 1., 1., 1., 1.])

Raises#

ValueError: If the sum of rowsize does not equal the length of arrays.
IndexError: If empty arrays.

clouddrift.ragged.chunk(x: list | ndarray | DataArray | Series, length: int, overlap: int = 0, align: str = 'start') → ndarray[source]#

Divide an array x into equal chunks of length length. The result is a 2-dimensional NumPy array of shape (num_chunks, length). The resulting number of chunks is determined based on the length of x, length, and overlap.

chunk can be combined with apply_ragged() to chunk a ragged array.

Parameters#

xlist or array-like: Array to divide into chunks.
lengthint: The length of each chunk.
overlapint, optional: The number of overlapping array elements across chunks. The default is 0. Must be smaller than length. For example, if length is 4 and overlap is 2, the chunks of [0, 1, 2, 3, 4, 5] will be np.array([[0, 1, 2, 3], [2, 3, 4, 5]]). Negative overlap can be used to offset chunks by some number of elements. For example, if length is 2 and overlap is -1, the chunks of [0, 1, 2, 3, 4, 5] will be np.array([[0, 1], [3, 4]]).
alignstr, optional [“start”, “middle”, “end”]: If the remainder of the length of x divided by the chunk length is a number N different from zero, this parameter controls which part of the array will be kept into the chunks. If align="start", the elements at the beginning of the array will be part of the chunks and N points are discarded at the end. If align=”middle”, floor(N/2) and ceil(N/2) elements will be discarded from the beginning and the end of the array, respectively. If align="end", the elements at the end of the array will be kept, and the N first elements are discarded. The default is “start”.

Returns#

np.ndarray: 2-dimensional array of shape (num_chunks, length).

Examples#

Chunk a simple list; this discards the end elements that exceed the last chunk:

>>> chunk([1, 2, 3, 4, 5], 2)
array([[1, 2],
       [3, 4]])

To discard the starting elements of the array instead, use align="end":

>>> chunk([1, 2, 3, 4, 5], 2, align="end")
array([[2, 3],
       [4, 5]])

To center the chunks by discarding both ends of the array, use align="middle":

>>> chunk([1, 2, 3, 4, 5, 6, 7, 8], 3, align="middle")
array([[2, 3, 4],
       [5, 6, 7]])

Specify overlap to get overlapping chunks:

>>> chunk([1, 2, 3, 4, 5], 2, overlap=1)
array([[1, 2],
       [2, 3],
       [3, 4],
       [4, 5]])

Use apply_ragged to chunk a ragged array by providing the row sizes; notice that you must pass the array to chunk as an array-like, not a list:

>>> x = np.array([1, 2, 3, 4, 5])
>>> rowsize = [2, 1, 2]
>>> apply_ragged(chunk, x, rowsize, 2)
array([[1, 2],
       [4, 5]])

Raises#

ValueError: If length < 0.
ValueError: If align not in ["start", "middle", "end"].
ZeroDivisionError: if length == 0.

Obtain a list of row indices from a list of observation indices of a ragged array.: A ragged array is constituted of rows of different sizes indicated by rowsize and is also constituted of a continuous sequence of observations with indices 0 to its length - 1. This function allows the user to obtain the row index of a given observation given its index. This answers the question: “In which row is an observation located?”

Parameters#

indexint or list or np.ndarray: A integer observation index or a list of observation indices of a ragged array.
rowsizelist or np.ndarray or xr.DataArray: A sequence of row sizes of a ragged array.

Returns#

list: A list of row indices.

Examples#

To obtain the row index of observation with index 5 within a ragged array of three consecutive rows of sizes 2, 4, and 3:

>>> obs_index_to_row(5, [2, 4, 3])
[1]

To obtain the row indices of observations with indices 0, 2, and 4 within a ragged array of three consecutive rows of sizes 2, 4, and 3:

>>> obs_index_to_row([0, 2, 4], [2, 4, 3])
[0, 1, 1]

Within a ragged array, removes arrays less than a specified row size.

Parameters#

raggednp.ndarray or pd.Series or xr.DataArray: A ragged array.
rowsizelist or np.ndarray[int] or pd.Series or xr.DataArray[int]: The size of each row in the input ragged array.
min_rowsize :: The minimum row size that will be kept.

Returns#

tuple[np.ndarray, np.ndarray]: A tuple of ragged array and size of each row.

Examples#

>>> from clouddrift.ragged import prune
>>> import numpy as np
>>> prune(np.array([1, 2, 3, 0, -1, -2]), np.array([3, 1, 2]),2)
(array([ 1,  2,  3, -1, -2]), array([3, 2]))

Raises#

ValueError: If the sum of rowsize does not equal the length of arrays.
IndexError: If empty ragged.

Parameters#

raggednp.ndarray or pd.Series or xr.DataArray: A ragged array.
rowsizelist or np.ndarray[int] or pd.Series or xr.DataArray[int]: The size of each row in the ragged array.
fill_valuefloat, optional: Fill value to use for the trailing elements of each row of the resulting regular array.

Returns#

np.ndarray: A two-dimensional array.

Examples#

By default, the fill value used is NaN:

>>> ragged_to_regular(np.array([1, 2, 3, 4, 5]), np.array([2, 1, 2]))
array([[ 1.,  2.],
       [ 3., nan],
       [ 4.,  5.]])

You can specify an alternative fill value:

>>> ragged_to_regular(np.array([1, 2, 3, 4, 5]), np.array([2, 1, 2]), fill_value=999)
array([[  1,   2],
       [  3, 999],
       [  4,   5]])

Parameters#

arraynp.ndarray: A two-dimensional array.
fill_valuefloat, optional: Fill value used to determine the bounds of contiguous segments.

Returns#

tuple[np.ndarray, np.ndarray]: A tuple of the ragged array and the size of each row.

Examples#

By default, NaN values found in the input regular array are excluded from the output ragged array:

>>> regular_to_ragged(np.array([[1, 2], [3, np.nan], [4, 5]]))
(array([1., 2., 3., 4., 5.]), array([2, 1, 2]))

Alternatively, a different fill value can be specified:

>>> regular_to_ragged(np.array([[1, 2], [3, -999], [4, 5]]), fill_value=-999)
(array([1, 2, 3, 4, 5]), array([2, 1, 2]))

Parameters#

rowsizelist or np.ndarray or xr.DataArray: A list of row sizes.

Returns#

np.ndarray: A list of indices.

Examples#

To obtain the indices within a ragged array of three consecutive rows of sizes 100, 202, and 53:

>>> rowsize_to_index([100, 202, 53])
array([  0, 100, 302, 355])

clouddrift.ragged.segment(x: ndarray, tolerance: float | timedelta64 | timedelta | Timedelta, rowsize: ndarray[int] = None) → ndarray[int][source]#

Divide an array into segments based on a tolerance value.

Parameters#

xlist, np.ndarray, or xr.DataArray: An array to divide into segment.
tolerancefloat, np.timedelta64, timedelta, pd.Timedelta: The maximum signed difference between consecutive points in a segment. The array x will be segmented wherever differences exceed the tolerance.
rowsizenp.ndarray[int], optional: The size of rows if x is originally a ragged array. If present, x will be divided both by gaps that exceed the tolerance, and by the original rows of the ragged array.

Returns#

np.ndarray[int]: An array of row sizes that divides the input array into segments.

Examples#

The simplest use of segment is to provide a tolerance value that is used to divide an array into segments: >>> from clouddrift.ragged import segment, subset >>> import numpy as np

>>> x = [0, 1, 1, 1, 2, 2, 3, 3, 3, 3, 4]
>>> segment(x, 0.5)
array([1, 3, 2, 4, 1])

If the array is already previously segmented (e.g. multiple rows in a ragged array), then the rowsize argument can be used to preserve the original segments:

>>> x = [0, 1, 1, 1, 2, 2, 3, 3, 3, 3, 4]
>>> rowsize = [3, 2, 6]
>>> segment(x, 0.5, rowsize)
array([1, 2, 1, 1, 1, 4, 1])

The tolerance can also be negative. In this case, the input array is segmented where the negative difference exceeds the negative value of the tolerance, i.e. where x[n+1] - x[n] < -tolerance:

>>> x = [0, 1, 2, 0, 1, 2]
>>> segment(x, -0.5)
array([3, 3])

To segment an array for both positive and negative gaps, invoke the function twice, once for a positive tolerance and once for a negative tolerance. The result of the first invocation can be passed as the rowsize argument to the first segment invocation:

>>> x = [1, 1, 2, 2, 1, 1, 2, 2]
>>> segment(x, 0.5, rowsize=segment(x, -0.5))
array([2, 2, 2, 2])

If the input array contains time objects, the tolerance must be a time interval:

>>> x = np.array([np.datetime64("2023-01-01"), np.datetime64("2023-01-02"),
...               np.datetime64("2023-01-03"), np.datetime64("2023-02-01"),
...               np.datetime64("2023-02-02")])
>>> segment(x, np.timedelta64(1, "D"))
array([3, 2])

clouddrift.ragged.subset(ds: Dataset, criteria: dict, id_var_name: str = 'id', rowsize_var_name: str = 'rowsize', row_dim_name: str = 'rows', obs_dim_name: str = 'obs', full_rows=False) → Dataset[source]#

Subset a ragged array xarray dataset as a function of one or more criteria. The criteria are passed with a dictionary, where a dictionary key is a variable to subset and the associated dictionary value is either a range (valuemin, valuemax), a list [value1, value2, valueN], a single value, or a masking function applied to any variable of the dataset.

This function needs to know the names of the dimensions of the ragged array dataset (row_dim_name and obs_dim_name), and the name of the rowsize variable (rowsize_var_name). Default values corresponds to the clouddrift convention (“rows”, “obs”, and “rowsize”) but should be changed as needed.

Parameters#

dsxr.Dataset: Xarray dataset composed of ragged arrays.
criteriadict: Dictionary containing the variables (as keys) and the ranges/values/functions (as values) to subset.
id_var_namestr, optional: Name of the variable with dimension row_dim_name containing the identification number of the rows (default is “id”).
rowsize_var_namestr, optional: Name of the variable containing the number of observations per row (default is “rowsize”).
row_dim_namestr, optional: Name of the row dimension (default is “rows”).
obs_dim_namestr, optional: Name of the observation dimension (default is “obs”).
full_rowsbool, optional: If True, the function returns complete rows for which the criteria are matched at least once. Default is False which means that only segments matching the criteria are returned when filtering along the observation dimension.

Returns#

xr.Dataset: Subset xarray dataset matching the criterion(a).

Examples#

Criteria are combined on any data (with dimension “obs”) or metadata (with dimension “rows”) variables part of the Dataset. The following examples are based on NOAA GDP datasets which can be accessed with the clouddrift.datasets module. In these datasets, each row of the ragged arrays corresponds to the data from a single drifter trajectory and the row_dim_name is “traj” and the obs_dim_name is “obs”.

Retrieve a region, like the Gulf of Mexico, using ranges of latitude and longitude: >>> from clouddrift.datasets import gdp6h >>> from clouddrift.ragged import subset >>> import numpy as np

>>> ds = gdp6h()
...

>>> subset(ds, {"lat": (21, 31), "lon": (-98, -78)}, row_dim_name="traj")
<xarray.Dataset> ...
...

The parameter full_rows can be used to retrieve trajectories passing through a region, for example all trajectories passing through the Gulf of Mexico:

>>> subset(ds, {"lat": (21, 31), "lon": (-98, -78)}, full_rows=True, row_dim_name="traj")
<xarray.Dataset> ...
...

Retrieve drogued trajectory segments:

>>> subset(ds, {"drogue_status": True}, row_dim_name="traj")
<xarray.Dataset> ...
Dimensions:                (traj: ..., obs: ...)
Coordinates:
    id                     (traj) int64 ...
    time                   (obs) datetime64[ns] ...
...

Retrieve trajectory segments with temperature higher than 25°C (303.15K):

>>> subset(ds, {"temp": (303.15, np.inf)}, row_dim_name="traj")
<xarray.Dataset> ...
...

You can use the same approach to return only the trajectories that are shorter than some number of observations (similar to prune() but for the entire dataset):

>>> subset(ds, {"rowsize": (0, 1000)}, row_dim_name="traj")
<xarray.Dataset> ...
...

Retrieve specific drifters using their IDs:

>>> subset(ds, {"id": [2578, 2582, 2583]}, row_dim_name="traj")
<xarray.Dataset> ...
...

Sometimes, you may want to retrieve specific rows of a ragged array. You can do that by filtering along the trajectory dimension directly, since this one corresponds to row numbers:

>>> rows = [5, 6, 7]
>>> subset(ds, {"traj": rows}, row_dim_name="traj")
<xarray.Dataset> ...
...

Retrieve a specific time period:

>>> subset(ds, {"time": (np.datetime64("2000-01-01"), np.datetime64("2020-01-31"))}, row_dim_name="traj")
<xarray.Dataset> ...
...

Note that to subset time variable, the range has to be defined as a function type of the variable. By default, xarray uses np.datetime64 to represent datetime data. If the datetime data is a datetime.datetime, or pd.Timestamp, the range would have to be defined accordingly.

Those criteria can also be combined:

>>> subset(ds, {"lat": (21, 31), "lon": (-98, -78), "drogue_status": True, "temp": (303.15, np.inf), "time": (np.datetime64("2000-01-01"), np.datetime64("2020-01-31"))}, row_dim_name="traj")
<xarray.Dataset> ...
...

You can also use a function to filter the data. For example, retrieve every other observation of each trajectory:

>>> func = (lambda arr: ((arr - arr[0]) % 2) == 0)
>>> subset(ds, {"id": func}, row_dim_name="traj")
<xarray.Dataset> ...
...

The filtering function can accept several input variables passed as a tuple. For example, retrieve drifters released in the Mediterranean Sea, but exclude those released in the Bay of Biscay and the Black Sea:

>>> def mediterranean_mask(lon: xr.DataArray, lat: xr.DataArray) -> xr.DataArray:
...    # Mediterranean Sea bounding box
...    in_med = np.logical_and(-6.0327 <= lon, np.logical_and(lon <= 36.2173,
...                                                           np.logical_and(30.2639 <= lat, lat <= 45.7833)))
...    # Bay of Biscay
...    in_biscay = np.logical_and(lon <= -0.1462, lat >= 43.2744)
...    # Black Sea
...    in_blacksea = np.logical_and(lon >= 27.4437, lat >= 40.9088)
...    return np.logical_and(in_med, np.logical_not(np.logical_or(in_biscay, in_blacksea)))
>>> subset(ds, {("start_lon", "start_lat"): mediterranean_mask}, row_dim_name="traj")
<xarray.Dataset> Size: ...
Dimensions:                (traj: ..., obs: ...)
Coordinates:
    id                     (traj) int64 ...
    time                   (obs) datetime64[ns] ...
...

Raises#

ValueError: If one of the variable in a criterion is not found in the Dataset.
TypeError: If one of the criteria key is a tuple while its associated value is not a Callable criterion.
TypeError: If variables of a criterion key associated to a Callable do not share the same dimension.

Parameters#

ragged_arrayarray-like: A ragged_array to unpack
rowsizearray-like: An array of integers whose values is the size of each row in the ragged array
rowsint or Iterable[int], optional: A row or list of rows to unpack. Default is None, which unpacks all rows.
axisint, optional: The axis along which to unpack the ragged array. Default is 0.

Returns#

list: A list of array-likes with sizes that correspond to the values in rowsize, and types that correspond to the type of ragged_array

Examples#

Unpacking longitude arrays from a ragged Xarray Dataset: >>> from clouddrift.ragged import unpack >>> from clouddrift.datasets import gdp6h

>>> ds = gdp6h()

>>> lon = unpack(ds.lon, ds["rowsize"]) # return a list[xr.DataArray] (slower)
>>> lon = unpack(ds.lon.values, ds["rowsize"]) # return a list[np.ndarray] (faster)
>>> first_lon = unpack(ds.lon.values, ds["rowsize"], rows=0) # return only the first row
>>> first_two_lons = unpack(ds.lon.values, ds["rowsize"], rows=[0, 1]) # return first two rows

Looping over trajectories in a ragged Xarray Dataset to compute velocities for each:

>>> from clouddrift.kinematics import velocity_from_position

>>> for lon, lat, time in list(zip(
...     unpack(ds.lon.values, ds["rowsize"]),
...     unpack(ds.lat.values, ds["rowsize"]),
...     unpack(ds.time.values, ds["rowsize"])
... )):
...     u, v = velocity_from_position(lon, lat, time)

clouddrift.ragged

Contents

clouddrift.ragged#

Parameters#

Returns#

Examples#

Raises#

Parameters#

Returns#

Examples#

Raises#

Parameters#

Returns#

Examples#

Parameters#

Returns#

Examples#

Raises#

See Also#

Parameters#

Returns#

Examples#

See Also#

Parameters#

Returns#

Examples#

See Also#

Parameters#

Returns#

Examples#

Parameters#

Returns#

Examples#

Parameters#

Returns#

Examples#

Raises#

See Also#

Parameters#

Returns#

Examples#