Developer’s guide#

Here are some pointers and requirements to add new methods to this package. Don’t hesitate to look at existing algorithms, or to open an issue if you are having troubles.

Front-detection algorithms can be defined in their own module. Filters are to be defined in the filters module. Post-processing of detected fronts are to be defined in the post module.

Front-detection algorithms should not be imported in the global __init__. However, filters and post-processing functions can be added to filters.__all__ and post.__all__, if they do not have other requirements than the mandatory dependencies (numpy and numba). This means the user can do from fronts_toolbox.filters import boa_numpy.

Input types and requirements#

This library aims to facilitate implementing algorithms for different types of input arrays:

Numpy,
Dask,
Xarray,
CUDA/Cupy arrays are not supported, but it could be added without too much hassle.

Beyond numpy and numba, none of those libraries are required. As much functionality as possible should be made available even if optional dependencies are not installed. You can use lazy imports inside functions:

def my_algorithm(...):
    import dask.array as da
    ...

You can also check for availability without importing by using util.module_available(); this is a very lightweight check. Finally you can safeguard imports behind a try-except block. This can simplify things a bit, however this imports available packages even if they are not needed by the user. That can lengthen the import time quite a bit in the case of Dask and Xarray.

Important

All additional packages (required or optional) must be indicated in the ‘tests’ optional dependencies in pyproject.toml, and in their documentaion.

Dispatcher#

A function may be defined for each type of input, suffixed with the library name (_numpy, _dask, _xarray, _cuda, etc.). A function that can handle any input can also be defined (not obligatory). To help with dispatching the input to the correct function, you can define a util.Dispatcher:

my_dispatcher = Dispatcher(
    "algorithm name (for error messages)",
    numpy=my_algorithm_numpy,
    xarray=my_algorithm_xarray,
)

def my_algorithm(input_field: NDArray | xr.DataArray):
    func = my_dispatcher.get_func(input_field)
    return func(input_field)

Note

Not all dispatchers need to contain an implementation for every possible type. The dispatcher will give appropriate an message error if an input type is unsupported, or if the needed library is not installed.

Documentation#

Having a function for each input type can lead to a lot of duplication in the docstrings. To help avoid repeating yourself, use the util.doc() decorator. It takes a dictionary containing the parameters documentation, and additional keys “init”, “returns”, “rtype” and “remove”. The decorated function should have a single line docstring. You can define a single dictionary and reuse for the different functions. You can remove keys that are not used for this specific decorated function and update the dictionary by passing additional keyword arguments for the decorator:

_doc = dict(
  init="""\
  The initial paragraph. Notice the long-string syntax
  to have multiple lines.""",
  returns="For :returns: role",
  rtype="A typehint for the :rtype: role",
  param1="Help for the first parameter",
  param1_type="The typehint for 'param1', this is optional"
  param2="Help for the second parameter",
  axes=util.axes_help,
)

@doc(_doc)
def my_first_function(param1, param2, axes):
    ...

@doc(
  _doc,
  remove=["axes"],  # axes is not used in this function
  param1="The help should change slightly for this parameter",
  rtype="The return type is different",
  dims=util.dims_help,
)
def my_second_function(param1, param2, dims):
  ...

The recurring axes and dims arguments have pre-defined help strings in util.axes_help and util.dims_help.

Generalized functions#

Front detection algorithms and filters will typically work on a 2D image, but it is useful to accommodate additional dimensions (like time for instance). This necessitates dealing with looping over those dimensions and specifying axes placement.

For Numpy, if you intend to compile your function with Numba, using numba.guvectorize() gives a generalized universal function. Sadly numpy.vectorize and numpy.frompyfunc() cannot deal with generalized ufuncs. Instead you can use util.apply_vectorized(). It only works for functions that take and return a single array and preserve its shape.

Dask handles generalized ufuncs and even allows to convert a Python function to a generalized ufunc compatible with Dask arrays (and only those). Computations with moving windows will either need to avoid chunking along the core dimensions or deal with overlap.

Xarray handles any function working on Numpy arrays with xarray.apply_ufunc(). If there is a need to use a specialized Dask function − to handle overlap for instance − you can apply the function directly to the underlying data. A util.Dispatcher can help selecting the function depending on the input type:

func = my_dispatcher.get_func(input)
output = func(input, **kwargs)
arr = xr.DataArray(output, ...)

Compiling with Numba#

The goal of this library is to provide computationally efficient tools, that can easily scale on large datasets. Please write your core function to avoid pure python loops, or alternatively compile your core function with Numba.

Using numba.guvectorize() allows to easily create a generalized universal function. This ensures that your computations will be properly vectorized and that it deals nicely with broadcasting and type conversion.

Note that when using guvectorize with target="parallel" and cache=True the import is quite slow (see issue #8085). I tried to defer fetching the cache at call time, but this failed to be pickled correctly when using dask.distributed.

Fonctions compiled with numba.guvectorize only accept positional arguments, but (as I understand it) Dask treats positional arguments as chunked data. Use the util.KwargsWrap wrapper to transform kwargs into positional arguments: Dask will see keyword arguments, but the fonction will receive positional arguments.

Moving window size#

Multiple algorithms use a moving window. The user will provide the window size: the number of pixels along its sides. A window of size 3x3 contains 9 pixels. Please allow the user to input the window size as described in Moving window size.

In the implementation, it is often easier to loop over half the window size (from the central pixel). This package provides util.get_window_reach() to obtain the reach of the window. We define it as the number of pixels between the central pixel (excluding it) and the window edge (including it). A window of size 3 has a reach of 1, a window of size 5 a reach of 2, etc.

Axes management#

You will probably need to give your functions the indices of core axes it must work onto (typically the axes corresponding to latitude and longitude). When working with generalized universal function you will need to specify the axes indices to the gufunc via the “axes” keyword argument, whose syntax is not the simplest (see Universal functions (ufunc)).

I suggest here to simplify things for the user. They only have to supply a sequence of indices (or of dimensions for xarray) which is then is accommodated to the gufunc. The function util.get_axes_kwarg() will automatically try to do that from a given signature. For instance if the core axes are specified as y,x:

Numpy and Dask

def function_numpy(..., axes: Sequence[int] | None = None, **kwargs):
    """...

        Parameters
        ----------
        axes:
            Indices of the the y/lat and x/lon axes on which to work. If
            None (default), the last two axes are used.
    """
    if axes is not None:
        kwargs["axes"] = get_axes_kwarg(function.signature, axes, "y,x")

    # kwargs is then passed to the compiled gufunc

Xarray

DEFAULT_DIMS: list[Hashable] = ["lat", "lon"]
"""Default dimensions names to use if none are provided."""

def function_xarray(input_field, dims: Collection[Hashable] | None = None):
    """...

    Parameters
    ----------
    dims:
        Names of the dimensions along which to compute the index. Order
        is irrelevant, no reordering will be made between the two
        dimensions. If not specified, is taken as module-wide variable
        :data:`DEFAULT_DIMS` which defaults to ``{'lat', 'lon'}``.
    """
    if dims is None:
        dims = DEFAULT_DIMS

    axes = sorted(input_field.get_axis_num(dims))

    # axes can then be passed to the Numpy or Dask function

Masked values#

If possible, please try to make your function resilient to missing values in the input field. This may require additional care to the compiled function implementation. There are several ways to go about it.

You can require a mask argument that is obtained outside of the function (for instance with xarray.DataArray.isnull()).

You can compute the mask directly in the function, using numpy.isfinite. This has the advantage of simplifying the signature, and can give you more control over how and when the mask is computed. More importantly, it can reduce the operation count when using Dask (since you avoid a da.isfinite call outside the function).

Note

Xarray represents missing values with np.nan.

Testing and benchmark#

Added functions must be tested. Define new test functions in tests/.... Those tests only check if the function executes for different kinds of input, as well as the output metadata. They do not test for correctness, though you are welcome to write more advanced tests if your algorithm allows it.

To check the actual output of your function, please add a jupyter notebook to the documentation in doc/gallery/. The notebook is here to showcase the application of your algorithm to idealized data or real-life samples (both available in _fields).

Some benchmarks can use data samples stored on Zenodo (doi:10.5281/zenodo.15769617). Use _fields.sample() to access them in the form of Xarray datasets. Open an issue to add more data if necessary.

Documentation#

Each algorithm should have a single documentation page in doc/algorithms/. It must be indexed in the relevant toctree in doc/algorithms/index.rst.

This page should contain a brief description of the method, eventually with implementation details. The goal is to make the method understandable, reasonably easy to use, but also modifiable by savvy users. If applicable, the documentation must contain a list of reference(s) with DOI links.

If additional packages are required, they should be specified − along for what specific features they are necessary if applicable − in the “Supported input types and requirements” section.

The code itself should be properly documented as well. The module must be added in the toctree of doc/api.rst. Numpy docstring style is preferred. Type hinting is not mandatory but preferred as well.