lsdb.nested.datasets#

Submodules#

Classes#

BoxSearch

Perform a box search to filter the catalog. This type of search is used for a

ConeSearch

Perform a cone search to filter the catalog

PixelSearch

Filter the catalog by HEALPix pixels.

NestedFrame

An extension for a Dask Dataframe that has Nested-Pandas functionality.

Functions#

generate_data(n_base, n_layer[, npartitions, seed, ...])

Generates a toy dataset.

_generate_cone_search(→ tuple[numpy.ndarray, ...)

_generate_box_radec(ra_range, dec_range, n_base[, seed])

Generates a random set of RA and Dec values within a given range.

_generate_pixel_search(→ tuple[numpy.ndarray, ...)

generate_catalog(n_base, n_layer[, seed, ra_range, ...])

Generates a toy catalog.

Package Contents#

class BoxSearch(ra: tuple[float, float], dec: tuple[float, float], fine: bool = True)[source]#

Bases: lsdb.core.search.abstract_search.AbstractSearch

Perform a box search to filter the catalog. This type of search is used for a range of right ascension or declination, where the right ascension edges follow great arc circles and the declination edges follow small arc circles.

Filters to points within the ra / dec region, specified in degrees. Filters partitions in the catalog to those that have some overlap with the region.

filter_hc_catalog(hc_structure: lsdb.types.HCCatalogTypeVar) mocpy.MOC[source]#

Filters catalog pixels according to the box

search_points(frame: nested_pandas.NestedFrame, metadata: hats.catalog.TableProperties) nested_pandas.NestedFrame[source]#

Determine the search results within a data frame

class ConeSearch(ra: float, dec: float, radius_arcsec: float, fine: bool = True)[source]#

Bases: lsdb.core.search.abstract_search.AbstractSearch

Perform a cone search to filter the catalog

Filters to points within radius great circle distance to the point specified by ra and dec in degrees. Filters partitions in the catalog to those that have some overlap with the cone.

ra#
dec#
radius_arcsec#
filter_hc_catalog(hc_structure: lsdb.types.HCCatalogTypeVar) mocpy.MOC[source]#

Filters catalog pixels according to the cone

search_points(frame: nested_pandas.NestedFrame, metadata: hats.catalog.TableProperties) nested_pandas.NestedFrame[source]#

Determine the search results within a data frame

_perform_plot(ax: astropy.visualization.wcsaxes.WCSAxes, **kwargs)[source]#

Perform the plot of the search region on an initialized WCSAxes

class PixelSearch(pixels: tuple[int, int] | hats.pixel_math.HealpixPixel | list[tuple[int, int] | hats.pixel_math.HealpixPixel])[source]#

Bases: lsdb.core.search.abstract_search.AbstractSearch

Filter the catalog by HEALPix pixels.

Filters partitions in the catalog to those that are in a specified pixel set. Does not filter points inside those partitions.

classmethod from_radec(ra: float | list[float], dec: float | list[float]) PixelSearch[source]#

Create a pixel search region, based on radec points.

Parameters:
  • ra (float|list[float]) – celestial coordinates, right ascension in degrees

  • dec (float|list[float]) – celestial coordinates, declination in degrees

filter_hc_catalog(hc_structure: lsdb.types.HCCatalogTypeVar) lsdb.types.HCCatalogTypeVar[source]#

Determine the target partitions for further filtering.

search_points(frame: nested_pandas.NestedFrame, _) nested_pandas.NestedFrame[source]#

Determine the search results within a data frame

class NestedFrame(expr)[source]#

Bases: _Frame, dask.dataframe.DataFrame

An extension for a Dask Dataframe that has Nested-Pandas functionality.

Examples

>>> import lsdb.nested as nd 
>>> base = nd.NestedFrame(base_data) 
>>> layer = nd.NestedFrame(layer_data) 
>>> base.add_nested(layer, "layer") 
_partition_type#
__getitem__(item)[source]#

Adds custom __getitem__ functionality for nested columns

__setitem__(key, value)[source]#

Adds custom __setitem__ behavior for nested columns

_repr_html_()[source]#
classmethod from_pandas(data, npartitions=None, chunksize=None, sort=True) NestedFrame[source]#

Returns an LSDB.nested NestedFrame constructed from a Nested-Pandas NestedFrame or Pandas DataFrame.

Parameters:
  • data (NestedFrame or DataFrame) – Nested-Pandas NestedFrame containing the underlying data

  • npartitions (int, optional) – The number of partitions of the index to create. Note that depending on the size and index of the dataframe, the output may have fewer partitions than requested.

  • chunksize (int, optional) – The desired number of rows per index partition to use. Note that depending on the size and index of the dataframe, actual partition sizes may vary.

  • sort (bool, optional) – Whether to sort the frame by a default index.

Returns:

result – The constructed Dask-Nested NestedFrame object.

Return type:

NestedFrame

classmethod from_dask_dataframe(df: dask.dataframe.DataFrame) NestedFrame[source]#

Converts a Dask Dataframe to a Dask-Nested NestedFrame

Parameters:

df – A Dask Dataframe to convert

Return type:

lsdb.nested.NestedFrame

classmethod from_delayed(dfs, meta=None, divisions=None, prefix='from-delayed', verify_meta=True)[source]#

Create LSDB.nested NestedFrames from many Dask Delayed objects.

Docstring is copied from dask.dataframe.from_delayed.

Parameters:
  • dfs – A dask.delayed.Delayed, a distributed.Future, or an iterable of either of these objects, e.g. returned by client.submit. These comprise the individual partitions of the resulting dataframe. If a single object is provided (not an iterable), then the resulting dataframe will have only one partition.

  • meta – An empty NestedFrame, pd.DataFrame, or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided (note that the order of the names should match the order of the columns). Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

  • divisions – Partition boundaries along the index. For tuple, see https://docs.dask.org/en/latest/dataframe-design.html#partitions For string ‘sorted’ will compute the delayed values to find index values. Assumes that the indexes are mutually sorted. If None, then won’t use index information

  • prefix – Prefix to prepend to the keys.

  • verify_meta – If True check that the partitions have consistent metadata, defaults to True.

classmethod from_map(func, *iterables, args=None, meta=None, divisions=None, label=None, enforce_metadata=True, **kwargs)[source]#

Create a DataFrame collection from a custom function map

WARNING: The from_map API is experimental, and stability is not yet guaranteed. Use at your own risk!

Parameters:
  • func (callable) – Function used to create each partition. If func satisfies the DataFrameIOFunction protocol, column projection will be enabled.

  • *iterables (Iterable objects) – Iterable objects to map to each output partition. All iterables must be the same length. This length determines the number of partitions in the output collection (only one element of each iterable will be passed to func for each partition).

  • args (list or tuple, optional) – Positional arguments to broadcast to each output partition. Note that these arguments will always be passed to func after the iterables positional arguments.

  • meta – An empty NestedFrame, pd.DataFrame, or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided (note that the order of the names should match the order of the columns). Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

  • divisions (tuple, str, optional) – Partition boundaries along the index. For tuple, see https://docs.dask.org/en/latest/dataframe-design.html#partitions For string ‘sorted’ will compute the delayed values to find index values. Assumes that the indexes are mutually sorted. If None, then won’t use index information

  • label (str, optional) – String to use as the function-name label in the output collection-key names.

  • enforce_metadata (bool, default True) – Whether to enforce at runtime that the structure of the DataFrame produced by func actually matches the structure of meta. This will rename and reorder columns for each partition, and will raise an error if this doesn’t work, but it won’t raise if dtypes don’t match.

  • **kwargs – Key-word arguments to broadcast to each output partition. These same arguments will be passed to func for every output partition.

classmethod from_flat(df, base_columns, nested_columns=None, on=None, name='nested')[source]#

Creates a NestedFrame with base and nested columns from a flat dataframe.

Parameters:
  • df (dd.DataFrame or nd.NestedFrame) – A flat dataframe.

  • base_columns (list-like) – The columns that should be used as base (flat) columns in the output dataframe.

  • nested_columns (list-like, or None) – The columns that should be packed into a nested column. All columns in the list will attempt to be packed into a single nested column with the name provided in nested_name. If None, is defined as all columns not in base_columns.

  • on (str or None) – The name of a column to use as the new index. Typically, the index should have a unique value per row for base columns, and should repeat for nested columns. For example, a dataframe with two columns; a=[1,1,1,2,2,2] and b=[5,10,15,20,25,30] would want an index like [0,0,0,1,1,1] if a is chosen as a base column. If not provided the current index will be used.

  • name – The name of the output column the nested_columns are packed into.

Returns:

A NestedFrame with the specified nesting structure.

Return type:

NestedFrame

classmethod from_lists(df, base_columns=None, list_columns=None, name='nested')[source]#

Creates a NestedFrame with base and nested columns from a flat dataframe.

Parameters:
  • df (dd.DataFrame or nd.NestedFrame) – A dataframe with list columns.

  • base_columns (list-like, or None) – Any columns that have non-list values in the input df. These will simply be kept as identical columns in the result

  • list_columns (list-like, or None) – The list-value columns that should be packed into a nested column. All columns in the list will attempt to be packed into a single nested column with the name provided in nested_name. All columns in list_columns must have pyarrow list dtypes, otherwise the operation will fail. If None, is defined as all columns not in base_columns.

  • name – The name of the output column the nested_columns are packed into.

Returns:

A NestedFrame with the specified nesting structure.

Return type:

NestedFrame

Note

As noted above, all columns in list_columns must have a pyarrow ListType dtype. This is needed for proper meta propagation. To convert a list column to this dtype, you can use this command structure: nf= nf.astype({“colname”: pd.ArrowDtype(pa.list_(pa.int64()))})

Where pa.int64 above should be replaced with the correct dtype of the underlying data accordingly.

Additionally, it’s a known issue in Dask (dask/dask#10139) that columns with list values will by default be converted to the string type. This will interfere with the ability to recast these to pyarrow lists. We recommend setting the following dask config setting to prevent this: dask.config.set({“dataframe.convert-string”:False})

compute(**kwargs)[source]#

Compute this Dask collection, returning the underlying dataframe or series.

property all_columns: dict#

returns a dictionary of columns for each base/nested dataframe

property nested_columns: list#

retrieves the base column names for all nested dataframes

_is_known_hierarchical_column(colname) bool[source]#

Determine whether a string is a known hierarchical column name

add_nested(nested, name, how='outer') NestedFrame[source]#

Packs a dataframe into a nested column

Parameters:
  • nested – A flat dataframe to pack into a nested column

  • name – The name given to the nested column

  • how ({‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’}, default ‘outer’) –

    How to handle the operation of the two objects.

    • left: use calling frame’s index (or column if on is specified)

    • right: use other’s index.

    • outer: form union of calling frame’s index (or column if on is

    specified) with other’s index, and sort it lexicographically.

    • inner: form intersection of calling frame’s index (or column if

    on is specified) with other’s index, preserving the order of the calling’s one.

    • cross: creates the cartesian product from both frames, preserves

    the order of the left keys.

Return type:

lsdb.nested.NestedFrame

query(expr) Self[source]#

Query the columns of a NestedFrame with a boolean expression. Specified queries can target nested columns in addition to the typical column set

Docstring copied from nested-pandas query

Parameters:

expr (str) –

The query string to evaluate.

Access nested columns using nested_df.nested_col (where nested_df refers to a particular nested dataframe and nested_col is a column of that nested dataframe).

You can refer to variables in the environment by prefixing them with an ‘@’ character like @a + b.

You can refer to column names that are not valid Python variable names by surrounding them in backticks. Thus, column names containing spaces or punctuations (besides underscores) or starting with digits must be surrounded by backticks. (For example, a column named “Area (cm^2)” would be referenced as `Area (cm^2)`). Column names which are Python keywords (like “list”, “for”, “import”, etc) cannot be used.

For example, if one of your columns is called a a and you want to sum it with b, your query should be `a a` + b.

Returns:

DataFrame resulting from the provided query expression.

Return type:

DataFrame

Notes

Queries that target a particular nested structure return a dataframe with rows of that particular nested structure filtered. For example, querying the NestedFrame “df” with nested structure “my_nested” as below will return all rows of df, but with mynested filtered by the condition:

>>> df.query("mynested.a > 2") 
dropna(*, axis: pandas._typing.Axis = 0, how: str | pandas._libs.lib.NoDefault = no_default, thresh: int | pandas._libs.lib.NoDefault = no_default, on_nested: bool = False, subset: pandas._typing.IndexLabel | None = None, inplace: bool = False, ignore_index: bool = False) Self[source]#

Remove missing values for one layer of the NestedFrame.

Parameters:
  • axis ({0 or 'index', 1 or 'columns'}, default 0) –

    Determine if rows or columns which contain missing values are removed.

    • 0, or ‘index’ : Drop rows which contain missing values.

    • 1, or ‘columns’ : Drop columns which contain missing value.

    Only a single axis is allowed.

  • how ({'any', 'all'}, default 'any') –

    Determine if row or column is removed from DataFrame, when we have at least one NA or all NA.

    • ’any’ : If any NA values are present, drop that row or column.

    • ’all’ : If all values are NA, drop that row or column.

  • thresh (int, optional) – Require that many non-NA values. Cannot be combined with how.

  • on_nested (str or bool, optional) – If not False, applies the call to the nested dataframe in the column with label equal to the provided string. If specified, the nested dataframe should align with any columns given in subset.

  • subset (column label or sequence of labels, optional) –

    Labels along other axis to consider, e.g. if you are dropping rows these would be a list of columns to include.

    Access nested columns using nested_df.nested_col (where nested_df refers to a particular nested dataframe and nested_col is a column of that nested dataframe).

  • inplace (bool, default False) – Whether to modify the DataFrame rather than creating a new one.

  • ignore_index (bool, default False) –

    If True, the resulting axis will be labeled 0, 1, …, n - 1.

    Added in version 2.0.0.

Returns:

DataFrame with NA entries dropped from it or None if inplace=True.

Return type:

DataFrame or None

Notes

Operations that target a particular nested structure return a dataframe with rows of that particular nested structure affected.

Values for on_nested and subset should be consistent in pointing to a single layer, multi-layer operations are not supported at this time.

sort_values(by: str | list[str], npartitions: int | None = None, ascending: bool | list[bool] = True, na_position: Literal['first'] | Literal['last'] = 'last', partition_size: float = 128000000.0, sort_function: collections.abc.Callable[[pandas.DataFrame], pandas.DataFrame] | None = None, sort_function_kwargs: collections.abc.Mapping[str, Any] | None = None, upsample: float = 1.0, ignore_index: bool | None = False, shuffle_method: str | None = None, **options) Self[source]#

Sort the dataset by a single column.

Sorting a parallel dataset requires expensive shuffles and is generally not recommended. See ‘set_index‘ for implementation details.

Parameters:#

by: str or list[str]

Column(s) to sort by.

npartitions: int, None, or ‘auto’

The ideal number of output partitions. If None, use the same as the input. If ‘auto’ then decide by memory use. Not used when sorting nested layers.

ascending: bool or list[bool], optional

Sort ascending vs. descending. Defaults to True. Specify list for multiple sort orders. If this is a list of bools, must match the length of the by.

na_position: {‘last’, ‘first’}, optional

Puts NaNs at the beginning if ‘first’, puts NaN at the end if ‘last’. Defaults to ‘last’.

partition_size: float, optional

The desired size of each partition in bytes. Defaults to 128e6 (128 MB). Not used in nested sorting.

sort_function: function, optional

Sorting function to use when sorting underlying partitions. If None, defaults to M.sort_values (the partition library’s implementation of sort_values). Not used when sorting nested layers.

sort_function_kwargs: dict, optional

Additional keyword arguments to pass to the partition sorting function. By default, by, ascending, and na_position are provided.

upsample: float, optional

Used to increase the number of samples for quantiles. Not used in nested sorting

ignore_index: bool, optional

If True, the resulting axis will be labeled 0, 1, …, n - 1. Defaults to False.

shuffle_method: str, optional

The method to use for shuffling data. Defaults to None. Not used in nested sorting

**options: keyword arguments, optional

Additional options to pass to the sorting function.

Returns:#

DataFrame

DataFrame with sorted values.

reduce(func, *args, meta=dsk_no_default, infer_nesting=True, **kwargs) NestedFrame[source]#

Takes a function and applies it to each top-level row of the NestedFrame.

docstring copied from nested-pandas

The user may specify which columns the function is applied to, with columns from the ‘base’ layer being passsed to the function as scalars and columns from the nested layers being passed as numpy arrays.

Parameters:
  • func (callable) – Function to apply to each nested dataframe. The first arguments to func should be which columns to apply the function to. See the Notes for recommendations on writing func outputs.

  • args (positional arguments) – A list of string column names to pull from the NestedFrame to pass along to the function. If the function has additional arguments, pass them as keyword arguments (e.g. arg_name=value)

  • meta (dataframe or series-like, optional) – The dask meta of the output. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended.

  • infer_nesting (bool, default True) – If True, the function will pack output columns into nested structures based on column names adhering to a nested naming scheme. E.g. “nested.b” and “nested.c” will be packed into a column called “nested” with columns “b” and “c”. If False, all outputs will be returned as base columns.

  • kwargs (keyword arguments, optional) – Keyword arguments to pass to the function.

Returns:

NestedFrame with the results of the function applied to the columns of the frame.

Return type:

NestedFrame

Notes

By default, reduce will produce a NestedFrame with enumerated column names for each returned value of the function. For more useful naming, it’s recommended to have func return a dictionary where each key is an output column of the dataframe returned by reduce.

Example User Function:

>>> def my_sum(col1, col2): 
>>>    '''reduce will return a NestedFrame with two columns''' 
>>>    return {"sum_col1": sum(col1), "sum_col2": sum(col2)} 

When using nesting inference (infer_nesting=True), the output may contain nested columns. In such cases, the meta should be provided with the appropriate dtype for these columns. For example, the following function, which produces a nested column “lc”:

>>> def complex_output(flux): 
>>>   return {"max_flux": np.max(flux), 
>>>           "lc.flux_quantiles": np.quantile(flux, [0.1, 0.2, 0.3, 0.4, 0.5]), 
>>>           "lc.labels": [0.1, 0.2, 0.3, 0.4, 0.5]} 

Would require the following meta:

>>> # create a NestedDtype for the nested column "lc"
>>> from nested_pandas.series.dtype import NestedDtype 
>>> lc_dtype = NestedDtype(pa.struct([pa.field("flux_quantiles",  
>>>                                   pa.list_(pa.float64())), 
>>>                                   pa.field("labels", pa.list_(pa.float64()))])) 
>>> # use the lc_dtype in meta creation
>>> result_meta = npd.NestedFrame({'max_flux':pd.Series([], dtype='float'), 
>>>                 'lc':pd.Series([], dtype=lc_dtype)}) 
to_parquet(path, by_layer=False, **kwargs) None[source]#

Creates parquet file(s) with the data of a NestedFrame, either as a single parquet file directory where each nested dataset is packed into its own column or as an individual parquet file directory for each layer.

Docstring copied from nested-pandas.

Note that here we always opt to use the pyarrow engine for writing parquet files.

Parameters:
  • path (str) – The path to the parquet directory to be written.

  • by_layer (bool, default True) –

    NOTE: by_layer=False will not reliably preserve divisions currently, be warned when using it that loading from such a dataset will likely require you to reset and set the index to generate divisions information.

    If False, writes the entire NestedFrame to a single parquet directory.

    If True, writes each layer to a separate parquet sub-directory within the directory specified by path. The filename for each outputted file will be named after its layer. For example for the base layer this is always “base”.

  • kwargs (keyword arguments, optional) – Keyword arguments to pass to the function.

Return type:

None

generate_data(n_base, n_layer, npartitions=1, seed=None, ra_range=(0.0, 360.0), dec_range=(-90, 90), search_region=None)[source]#

Generates a toy dataset.

Docstring copied from nested-pandas.

Parameters:
  • n_base (int) – The number of rows to generate for the base layer

  • n_layer (int, or dict) – The number of rows per n_base row to generate for a nested layer. Alternatively, a dictionary of layer label, layer_size pairs may be specified to created multiple nested columns with custom sizing.

  • npartitions (int) – The number of partitions to split the data into.

  • seed (int) – A seed to use for random generation of data

  • ra_range (tuple) – A tuple of the min and max values for the ra column in degrees

  • dec_range (tuple) – A tuple of the min and max values for the dec column in degrees

  • search_region (AbstractSearch) – A search region to apply to the generated data. Currently supports the ConeSearch, BoxSearch, and PixelSearch regions. Note that if provided, this will override the ra_range and dec_range parameters.

Returns:

The constructed Dask NestedFrame.

Return type:

NestedFrame

Examples

>>> from lsdb.nested.datasets import generate_data
>>> nf = generate_data(10,100)
>>> nf = generate_data(10, {"nested_a": 100, "nested_b": 200})

Constraining spatial ranges: >>> nf = generate_data(10, 100, ra_range=(0., 10.), dec_range=(-5., 0.))

Using a search region: >>> from lsdb.core.search import ConeSearch >>> nf = generate_data(10, 100, search_region=ConeSearch(5, 5, 1))

_generate_box_radec(ra_range, dec_range, n_base, seed=None)[source]#

Generates a random set of RA and Dec values within a given range.

Parameters:
  • ra_range (tuple) – A tuple of the min and max values for the ra column in degrees

  • dec_range (tuple) – A tuple of the min and max values for the dec column in degrees

  • n_base (int) – The number of rows to generate for the base layer

  • seed (int) – A seed to use for random generation of data

Returns:

An array of shape (n_base, 2) containing the generated RA and Dec values.

Return type:

np.ndarray

generate_catalog(n_base, n_layer, seed=None, ra_range=(0.0, 360.0), dec_range=(-90, 90), search_region=None, **kwargs)[source]#

Generates a toy catalog.

Parameters:
  • n_base (int) – The number of rows to generate for the base layer

  • n_layer (int, or dict) – The number of rows per n_base row to generate for a nested layer. Alternatively, a dictionary of layer label, layer_size pairs may be specified to created multiple nested columns with custom sizing.

  • seed (int) – A seed to use for random generation of data

  • ra_range (tuple) – A tuple of the min and max values for the ra column in degrees

  • dec_range (tuple) – A tuple of the min and max values for the dec column in degrees

  • search_region (AbstractSearch) – A search region to apply to the generated data. Currently supports the ConeSearch and BoxSearch regions. Note that if provided, this will override the ra_range and dec_range parameters.

  • **kwargs – Additional keyword arguments to pass to lsdb.from_dataframe.

Returns:

The constructed LSDB Catalog.

Return type:

Catalog

Examples

>>> from lsdb.nested.datasets import generate_catalog
>>> gen_cat = generate_catalog(10,100)
>>> gen_cat = generate_catalog(1000, 10, ra_range=(0.,10.), dec_range=(-5.,0.))

Constraining spatial ranges: >>> gen_cat = generate_data(10, 100, ra_range=(0., 10.), dec_range=(-5., 0.))

Using a search region: >>> from lsdb.core.search import ConeSearch # doctest: +SKIP >>> gen_cat = generate_data(10, 100, search_region=ConeSearch(5, 5, 1))