lsdb.catalog.dataset.healpix_dataset#

Classes#

HealpixDataset

LSDB Catalog DataFrame to perform analysis of sky catalogs and efficient

Module Contents#

class HealpixDataset(ddf: lsdb.nested.NestedFrame, ddf_pixel_map: lsdb.types.DaskDFPixelMap, hc_structure: hats.catalog.healpix_dataset.healpix_dataset.HealpixDataset)[source]#

Bases: lsdb.catalog.dataset.dataset.Dataset

LSDB Catalog DataFrame to perform analysis of sky catalogs and efficient spatial operations.

hc_structure[source]#

hats.Dataset object representing the structure and metadata of the HATS catalog

hc_structure: hats.catalog.healpix_dataset.healpix_dataset.HealpixDataset[source]#
_ddf_pixel_map[source]#
__getitem__(item)[source]#

Select a column or columns from the catalog.

__len__()[source]#

The number of rows in the catalog.

Returns:

The number of rows in the catalog, as specified in its metadata. This value is undetermined when the catalog is modified, and therefore an error is raised.

property nested_columns: list[str][source]#

The columns of the catalog that are nested.

Returns:

The list of nested columns in the catalog.

_repr_data()[source]#
property _repr_divisions[source]#
_create_modified_hc_structure(hc_structure=None, updated_schema=None, **kwargs) hats.catalog.healpix_dataset.healpix_dataset.HealpixDataset[source]#

Copy the catalog structure and override the specified catalog info parameters.

Returns:

A copy of the catalog’s structure with updated info parameters.

_create_updated_dataset(ddf: lsdb.nested.NestedFrame | None = None, ddf_pixel_map: lsdb.types.DaskDFPixelMap | None = None, hc_structure: hats.catalog.healpix_dataset.healpix_dataset.HealpixDataset | None = None, updated_catalog_info_params: dict | None = None) typing_extensions.Self[source]#

Creates a new copy of the catalog, updating any provided arguments

Shallow copies the ddf and ddf_pixel_map if not provided. Creates a new hc_structure if not provided. Updates the hc_structure with any provided catalog info parameters, resets the total rows, removes any default columns that don’t exist, and updates the pyarrow schema to reflect the new ddf.

Parameters:
  • ddf (nd.NestedFrame) – The catalog ddf to update in the new catalog

  • ddf_pixel_map (DaskDFPixelMap) – The partition to healpix pixel map to update in the new catalog

  • hc_structure (hats.HealpixDataset) – The hats HealpixDataset object to update in the new catalog

  • updated_catalog_info_params (dict) – The dictionary of updates to the parameters of the hats dataset object’s catalog_info

Returns:

A new dataset object with the arguments updated to those provided to the function, and the hc_structure metadata updated to match the new ddf

get_healpix_pixels() list[hats.pixel_math.HealpixPixel][source]#

Get all HEALPix pixels that are contained in the catalog

Returns:

List of all Healpix pixels in the catalog

get_ordered_healpix_pixels() list[hats.pixel_math.HealpixPixel][source]#

Get all HEALPix pixels that are contained in the catalog, ordered by breadth-first nested ordering.

Returns:

List of all Healpix pixels in the catalog

aggregate_column_statistics(use_default_columns: bool = True, exclude_hats_columns: bool = True, exclude_columns: list[str] | None = None, include_columns: list[str] | None = None, include_pixels: list[hats.pixel_math.HealpixPixel] | None = None) list[hats.pixel_math.HealpixPixel][source]#

Read footer statistics in parquet metadata, and report on global min/max values.

per_pixel_statistics(use_default_columns: bool = True, exclude_hats_columns: bool = True, exclude_columns: list[str] | None = None, include_columns: list[str] | None = None, include_stats: list[str] | None = None, multi_index=False, include_pixels: list[hats.pixel_math.HealpixPixel] | None = None) list[hats.pixel_math.HealpixPixel][source]#

Read footer statistics in parquet metadata, and report on global min/max values.

get_partition(order: int, pixel: int) lsdb.nested.NestedFrame[source]#

Get the dask partition for a given HEALPix pixel

Parameters:
  • order – Order of HEALPix pixel

  • pixel – HEALPix pixel number in NESTED ordering scheme

Returns:

Dask Dataframe with a single partition with data at that pixel

Raises:

ValueError – if no data exists for the specified pixel

get_partition_index(order: int, pixel: int) int[source]#

Get the dask partition for a given HEALPix pixel

Parameters:
  • order – Order of HEALPix pixel

  • pixel – HEALPix pixel number in NESTED ordering scheme

Returns:

Dask Dataframe with a single partition with data at that pixel

Raises:

ValueError – if no data exists for the specified pixel

property partitions[source]#

Returns the partitions of the catalog

property npartitions[source]#

Returns the number of partitions of the catalog

head(n: int = 5) nested_pandas.NestedFrame[source]#

Returns a few rows of initial data for previewing purposes.

Parameters:

n (int) – The number of desired rows.

Returns:

A NestedFrame with up to n rows of data.

tail(n: int = 5) nested_pandas.NestedFrame[source]#

Returns a few rows of data from the end of the catalog for previewing purposes.

Parameters:

n (int) – The number of desired rows.

Returns:

A NestedFrame with up to n rows of data.

sample(partition_id: int, n: int = 5, seed: int | None = None) nested_pandas.NestedFrame[source]#

Returns a few randomly sampled rows from a given partition.

Parameters:
  • partition_id (int) – the partition to sample.

  • n (int) – the number of desired rows.

  • seed (int) – random seed

As with NestedFrame.sample, n is an approximate number of items to return. The exact number of elements selected will depend on how your data is partitioned. (In practice, it should be pretty close.)

The seed argument is passed directly to random.seed in order to assist with creating predictable outputs when wanted, such as in unit tests.

Returns:

A NestedFrame with up to n rows of data.

random_sample(n: int = 5, seed: int | None = None) nested_pandas.NestedFrame[source]#

Returns a few randomly sampled rows, like self.sample(), except that it randomly samples all partitions in order to fulfill the rows.

Parameters:
  • n (int) – the number of desired rows.

  • seed (int) – random seed

As with .sample, n is an approximate number of items to return. The exact number of elements selected will depend on how your data is partitioned. (In practice, it should be pretty close.)

The seed argument is passed directly to random.seed in order to assist with creating predictable outputs when wanted, such as in unit tests.

Returns:

A NestedFrame with up to n rows of data.

query(expr: str) typing_extensions.Self[source]#

Filters catalog using a complex query expression

Parameters:

expr (str) – Query expression to evaluate. The column names that are not valid Python variables names should be wrapped in backticks, and any variable values can be injected using f-strings. The use of ‘@’ to reference variables is not supported. More information about pandas query strings is available here.

Returns:

A catalog that contains the data from the original catalog that complies with the query expression

Performs a search on the catalog from a list of pixels to search in

Parameters:
  • metadata (hc.catalog.Catalog | hc.catalog.MarginCatalog) – The metadata of the hats catalog after the coarse filtering is applied. The partitions it contains are only those that overlap with the spatial region.

  • search (AbstractSearch) – Instance of AbstractSearch.

Returns:

A tuple containing a dictionary mapping pixel to partition index and a dask dataframe containing the search results

search(search: lsdb.core.search.abstract_search.AbstractSearch)[source]#

Find rows by reusable search algorithm.

Filters partitions in the catalog to those that match some rough criteria. Filters to points that match some finer criteria.

Parameters:

search (AbstractSearch) – Instance of AbstractSearch.

Returns:

A new Catalog containing the points filtered to those matching the search parameters.

map_partitions(func: Callable[Ellipsis, nested_pandas.NestedFrame], *args, meta: pandas.DataFrame | pandas.Series | dict | Iterable | tuple | None = None, include_pixel: bool = False, **kwargs) typing_extensions.Self | dask.dataframe.Series[source]#

Applies a function to each partition in the catalog.

The ra and dec of each row is assumed to remain unchanged.

Parameters:
  • func (Callable) – The function applied to each partition, which will be called with: func(partition: npd.NestedFrame, *args, **kwargs) with the additional args and kwargs passed to the map_partitions function. If the include_pixel parameter is set, the function will be called with the healpix_pixel as the second positional argument set to the healpix pixel of the partition as func(partition: npd.NestedFrame, healpix_pixel: HealpixPixel, *args, **kwargs)

  • *args – Additional positional arguments to call func with.

  • meta (pd.DataFrame | pd.Series | Dict | Iterable | Tuple | None) – An empty pandas DataFrame that has columns matching the output of the function applied to a partition. Other types are accepted to describe the output dataframe format, for full details see the dask documentation https://blog.dask.org/2022/08/09/understanding-meta-keyword-argument If meta is None (default), LSDB will try to work out the output schema of the function by calling the function with an empty DataFrame. If the function does not work with an empty DataFrame, this will raise an error and meta must be set. Note that some operations in LSDB will generate empty partitions, though these can be removed by calling the Catalog.prune_empty_partitions method.

  • include_pixel (bool) – Whether to pass the Healpix Pixel of the partition as a HealpixPixel object to the second positional argument of the function

  • **kwargs – Additional keyword args to pass to the function. These are passed to the Dask DataFrame dask.dataframe.map_partitions function, so any of the dask function’s keyword args such as transform_divisions will be passed through and work as described in the dask documentation https://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.map_partitions.html

Returns:

A new catalog with each partition replaced with the output of the function applied to the original partition. If the function returns a non dataframe output, a dask Series will be returned.

prune_empty_partitions(persist: bool = False) typing_extensions.Self[source]#

Prunes the catalog of its empty partitions

Parameters:

persist (bool) – If True previous computations are saved. Defaults to False.

Returns:

A new catalog containing only its non-empty partitions

_get_non_empty_partitions() tuple[list[hats.pixel_math.HealpixPixel], numpy.ndarray][source]#

Determines which pixels and partitions of a catalog are not empty

Returns:

A tuple with the non-empty pixels and respective partitions

skymap_data(func: Callable[[nested_pandas.NestedFrame, hats.pixel_math.HealpixPixel], Any], order: int | None = None, default_value: Any = 0.0, **kwargs) dict[hats.pixel_math.HealpixPixel, dask.delayed.Delayed][source]#

Perform a function on each partition of the catalog, returning a dict of values for each pixel.

Parameters:
  • func (Callable[[npd.NestedFrame, HealpixPixel], Any]) – A function that takes a pandas DataFrame with the data in a partition, the HealpixPixel of the partition, and any other keyword arguments and returns an aggregated value

  • order (int | None) – The HEALPix order to compute the skymap at. If None (default), will compute for each partition in the catalog at their own orders. If a value other than None, each partition will be grouped by pixel number at the order specified and the function will be applied to each group.

  • default_value (Any) – The value to use at pixels that aren’t covered by the catalog (default 0)

  • **kwargs – Arguments to pass to the function

Returns:

A dict of Delayed values, one for the function applied to each partition of the catalog. If order is not None, the Delayed objects will be numpy arrays with all pixels within the partition at the specified order. Any pixels within a partition that have no coverage will have the default_value as its result, as well as any pixels for which the aggregate function returns None.

skymap_histogram(func: Callable[[nested_pandas.NestedFrame, hats.pixel_math.HealpixPixel], Any], order: int | None = None, default_value: Any = 0.0, plot=False, plotting_args: dict | None = None, **kwargs) numpy.ndarray[source]#

Get a histogram with the result of a given function applied to the points in each HEALPix pixel of a given order

Parameters:
  • func (Callable[[npd.NestedFrame, HealpixPixel], Any]) – A function that takes a pandas DataFrame and the HealpixPixel the partition is from and returns a value

  • order (int | None) – The HEALPix order to compute the skymap at. If None (default), will compute for each partition in the catalog at their own orders. If a value other than None, each partition will be grouped by pixel number at the order specified and the function will be applied to each group.

  • default_value (Any) – The value to use at pixels that aren’t covered by the catalog (default 0)

  • **kwargs – Arguments to pass to the given function

Returns:

A 1-dimensional numpy array where each index i is equal to the value of the function applied to the points within the HEALPix pixel with pixel number i in NESTED ordering at a specified order. If no order is supplied, the order of the resulting histogram will be the highest order partition in the catalog, and the function will be applied to the partitions of the catalog with the result copied to all pixels if the catalog partition is at a lower order than the histogram order.

If order is specified, any pixels at the specified order not covered by the catalog or any pixels that the function returns None will use the default_value.

plot_pixels(projection: str = 'MOL', **kwargs) tuple[matplotlib.figure.Figure, astropy.visualization.wcsaxes.WCSAxes][source]#

Create a visual map of the pixel density of the catalog.

Parameters:

projection (str) – https://docs.astropy.org/en/stable/wcs/supported_projections.html kwargs (dict): additional keyword arguments to pass to plotting call.

plot_coverage(**kwargs) tuple[matplotlib.figure.Figure, astropy.visualization.wcsaxes.WCSAxes][source]#

Create a visual map of the coverage of the catalog.

Parameters:

kwargs – additional keyword arguments to pass to hats.Catalog.plot_moc

to_hats(base_catalog_path: str | pathlib.Path | upath.UPath, *, catalog_name: str | None = None, default_columns: list[str] | None = None, overwrite: bool = False, **kwargs)[source]#

Saves the catalog to disk in HATS format

Parameters:
  • base_catalog_path (str) – Location where catalog is saved to

  • catalog_name (str) – The name of the catalog to be saved

  • default_columns (list[str]) – A metadata property with the list of the columns in the catalog to be loaded by default. By default, uses the default columns from the original hats catalogs if they exist.

  • overwrite (bool) – If True existing catalog is overwritten

  • **kwargs – Arguments to pass to the parquet write operations

dropna(*, axis: pandas._typing.Axis = 0, how: pandas._typing.AnyAll | pandas._libs.lib.NoDefault = no_default, thresh: int | pandas._libs.lib.NoDefault = no_default, on_nested: bool = False, subset: pandas._typing.IndexLabel | None = None, ignore_index: bool = False) typing_extensions.Self[source]#

Remove missing values for one layer of nested columns in the catalog.

Parameters:
  • axis ({0 or 'index', 1 or 'columns'}, default 0) –

    Determine if rows or columns which contain missing values are removed.

    • 0, or ‘index’ : Drop rows which contain missing values.

    • 1, or ‘columns’ : Drop columns which contain missing value.

    Only a single axis is allowed.

  • how ({'any', 'all'}, default 'any') –

    Determine if row or column is removed from catalog, when we have at least one NA or all NA.

    • ’any’ : If any NA values are present, drop that row or column.

    • ’all’ : If all values are NA, drop that row or column.

  • thresh (int, optional) – Require that many non-NA values. Cannot be combined with how.

  • on_nested (str or bool, optional) – If not False, applies the call to the nested dataframe in the column with label equal to the provided string. If specified, the nested dataframe should align with any columns given in subset.

  • subset (column label or sequence of labels, optional) –

    Labels along other axis to consider, e.g. if you are dropping rows these would be a list of columns to include.

    Access nested columns using nested_df.nested_col (where nested_df refers to a particular nested dataframe and nested_col is a column of that nested dataframe).

  • ignore_index (bool, default False) –

    If True, the resulting axis will be labeled 0, 1, …, n - 1.

    Added in version 2.0.0.

Returns:

Catalog with NA entries dropped from it.

Return type:

Catalog

Notes

Operations that target a particular nested structure return a dataframe with rows of that particular nested structure affected.

Values for on_nested and subset should be consistent in pointing to a single layer, multi-layer operations are not supported at this time.

nest_lists(base_columns: list[str] | None = None, list_columns: list[str] | None = None, name: str = 'nested') typing_extensions.Self[source]#

Creates a new catalog with a set of list columns packed into a nested column.

Parameters:
  • base_columns (list-like or None) – Any columns that have non-list values in the input catalog. These will simply be kept as identical columns in the result. If None, is inferred to be all columns in the input catalog that are not considered list-value columns.

  • list_columns (list-like or None) – The list-value columns that should be packed into a nested column. All columns in the list will attempt to be packed into a single nested column with the name provided in nested_name. All columns in list_columns must have pyarrow list dtypes, otherwise the operation will fail. If None, is defined as all columns not in base_columns.

  • name (str) – The name of the output column the nested_columns are packed into.

Returns:

A new catalog with specified list columns nested into a new nested column.

Note

As noted above, all columns in list_columns must have a pyarrow ListType dtype. This is needed for proper meta propagation. To convert a list column to this dtype, you can use this command structure: nf= nf.astype({“colname”: pd.ArrowDtype(pa.list_(pa.int64()))}) Where pa.int64 above should be replaced with the correct dtype of the underlying data accordingly. Additionally, it’s a known issue in Dask (dask/dask#10139) that columns with list values will by default be converted to the string type. This will interfere with the ability to recast these to pyarrow lists. We recommend setting the following dask config setting to prevent this: dask.config.set({“dataframe.convert-string”:False})

reduce(func, *args, meta=None, append_columns=False, infer_nesting=True, **kwargs) typing_extensions.Self[source]#

Takes a function and applies it to each top-level row of the Catalog.

docstring copied from nested-pandas

The user may specify which columns the function is applied to, with columns from the ‘base’ layer being passsed to the function as scalars and columns from the nested layers being passed as numpy arrays.

Parameters:
  • func (callable) – Function to apply to each nested dataframe. The first arguments to func should be which columns to apply the function to. See the Notes for recommendations on writing func outputs.

  • args (positional arguments) – A list of string column names to pull from the NestedFrame to pass along to the function. If the function has additional arguments, pass them as keyword arguments (e.g. arg_name=value)

  • meta (dataframe or series-like, optional) – The dask meta of the output. If append_columns is True, the meta should specify just the additional columns output by func.

  • append_columns (bool) – If the output columns should be appended to the orignal dataframe.

  • infer_nesting (bool) – If True, the function will pack output columns into nested structures based on column names adhering to a nested naming scheme. E.g. “nested.b” and “nested.c” will be packed into a column called “nested” with columns “b” and “c”. If False, all outputs will be returned as base columns.

  • kwargs (keyword arguments, optional) – Keyword arguments to pass to the function.

Returns:

HealpixDataset with the results of the function applied to the columns of the frame.

Return type:

HealpixDataset

Notes

By default, reduce will produce a NestedFrame with enumerated column names for each returned value of the function. For more useful naming, it’s recommended to have func return a dictionary where each key is an output column of the dataframe returned by reduce.

Example User Function:

>>> import numpy as np
>>> import lsdb
>>> catalog = lsdb.from_dataframe({"ra":[0, 10], "dec":[5, 15], "mag":[21, 22], "mag_err":[.1, .2]})
>>> def my_sigma(col1, col2):
...    '''reduce will return a NestedFrame with two columns'''
...    return {"plus_one": col1+col2, "minus_one": col1-col2}
>>> meta = {"plus_one": np.float64, "minus_one": np.float64}
>>> catalog.reduce(my_sigma, 'mag', 'mag_err', meta=meta).compute().reset_index()
           _healpix_29  plus_one  minus_one
0  1372475556631677955      21.1       20.9
1  1389879706834706546      22.2       21.8
plot_points(*, ra_column: str | None = None, dec_column: str | None = None, color_col: str | None = None, projection: str = 'MOL', title: str | None = None, fov: astropy.units.Quantity | tuple[astropy.units.Quantity, astropy.units.Quantity] | None = None, center: astropy.coordinates.SkyCoord | None = None, wcs: astropy.wcs.WCS | None = None, frame_class: Type[astropy.visualization.wcsaxes.frame.BaseFrame] | None = None, ax: astropy.visualization.wcsaxes.WCSAxes | None = None, fig: matplotlib.figure.Figure | None = None, **kwargs)[source]#

Plots the points in the catalog as a scatter plot

Performs a scatter plot on a WCSAxes after computing the points of the catalog. This will perform compute on the catalog, and so may be slow/resource intensive. If the fov or wcs args are set, only the partitions in the catalog visible to the plot will be computed. The scatter points can be colored by a column of the catalog by using the color_col kwarg

Parameters:
  • ra_column (str | None) – The column to use as the RA of the points to plot. Defaults to the catalog’s default RA column. Useful for plotting joined or cross-matched points

  • dec_column (str | None) – The column to use as the Declination of the points to plot. Defaults to the catalog’s default Declination column. Useful for plotting joined or cross-matched points

  • color_col (str | None) – The column to use as the color array for the scatter plot. Allows coloring of the points by the values of a given column.

  • projection (str) – The projection to use in the WCS. Available projections listed at https://docs.astropy.org/en/stable/wcs/supported_projections.html

  • title (str) – The title of the plot

  • fov (Quantity or Sequence[Quantity, Quantity] | None) – The Field of View of the WCS. Must be an astropy Quantity with an angular unit, or a tuple of quantities for different longitude and latitude FOVs (Default covers the full sky)

  • center (SkyCoord | None) – The center of the projection in the WCS (Default: SkyCoord(0, 0))

  • wcs (WCS | None) – The WCS to specify the projection of the plot. If used, all other WCS parameters are ignored and the parameters from the WCS object is used.

  • frame_class (Type[BaseFrame] | None) – The class of the frame for the WCSAxes to be initialized with. if the ax kwarg is used, this value is ignored (By Default uses EllipticalFrame for full sky projection. If FOV is set, RectangularFrame is used)

  • ax (WCSAxes | None) – The matplotlib axes to plot onto. If None, an axes will be created to be used. If specified, the axes must be an astropy WCSAxes, and the wcs parameter must be set with the WCS object used in the axes. (Default: None)

  • fig (Figure | None) – The matplotlib figure to add the axes to. If None, one will be created, unless ax is specified (Default: None)

  • **kwargs – Additional kwargs to pass to creating the matplotlib scatter function. These include c for color, s for the size of hte points, marker for the maker type, cmap and norm if color_col is used

Returns:

Tuple[Figure, WCSAxes] - The figure and axes used for the plot

sort_nested_values(by: str | list[str], ascending: bool | list[bool] = True, na_position: Literal['first'] | Literal['last'] = 'last', ignore_index: bool | None = False, **options) typing_extensions.Self[source]#

Sort nested columns for each row in the catalog.

Parameters:
  • by – str or list[str] Column(s) to sort by.

  • ascending – bool or list[bool], optional Sort ascending vs. descending. Defaults to True. Specify list for multiple sort orders. If this is a list of bools, must match the length of the by.

  • na_position – {‘last’, ‘first’}, optional Puts NaNs at the beginning if ‘first’, puts NaN at the end if ‘last’. Defaults to ‘last’.

  • ignore_index – bool, optional If True, the resulting axis will be labeled 0, 1, …, n - 1. Defaults to False.

  • **options – keyword arguments, optional Additional options to pass to the sorting function.

Returns:

A new catalog where the specified nested columns are sorted.