lsdb#

Submodules#

Classes#

Catalog

LSDB Catalog DataFrame to perform analysis of sky catalogs and efficient

MarginCatalog

LSDB Catalog DataFrame to contain the "margin" of another HATS catalog.

BoxSearch

Perform a box search to filter the catalog. This type of search is used for a

ConeSearch

Perform a cone search to filter the catalog

PolygonSearch

Perform a polygonal search to filter the catalog.

Functions#

crossmatch(→ lsdb.catalog.Catalog)

Perform a cross-match between two frames, two catalogs, a catalog and a frame, or a frame and

from_dataframe(→ lsdb.catalog.Catalog)

Load a catalog from a Pandas Dataframe.

read_hats(→ lsdb.types.CatalogTypeVar | None)

Load a catalog from a HATS formatted catalog.

Package Contents#

class Catalog(ddf: nested_dask.NestedFrame, ddf_pixel_map: lsdb.types.DaskDFPixelMap, hc_structure: hats.catalog.Catalog, margin: lsdb.catalog.margin_catalog.MarginCatalog | None = None)[source]#

Bases: lsdb.catalog.dataset.healpix_dataset.HealpixDataset

LSDB Catalog DataFrame to perform analysis of sky catalogs and efficient spatial operations.

hc_structure#

hats.Catalog object representing the structure and metadata of the HATS catalog

hc_structure: hats.catalog.Catalog#
margin = None#
_create_updated_dataset(ddf: nested_dask.NestedFrame | None = None, ddf_pixel_map: lsdb.types.DaskDFPixelMap | None = None, hc_structure: hats.catalog.healpix_dataset.healpix_dataset.HealpixDataset | None = None, updated_catalog_info_params: dict | None = None, margin: lsdb.catalog.margin_catalog.MarginCatalog | None = None) typing_extensions.Self[source]#

Creates a new copy of the catalog, updating any provided arguments

Shallow copies the ddf and ddf_pixel_map if not provided. Creates a new hc_structure if not provided. Updates the hc_structure with any provided catalog info parameters, resets the total rows, removes any default columns that don’t exist, and updates the pyarrow schema to reflect the new ddf.

Parameters:
  • ddf (nd.NestedFrame) – The catalog ddf to update in the new catalog

  • ddf_pixel_map (DaskDFPixelMap) – The partition to healpix pixel map to update in the new catalog

  • hc_structure (hats.HealpixDataset) – The hats HealpixDataset object to update in the new catalog

  • updated_catalog_info_params (dict) – The dictionary of updates to the parameters of the hats dataset object’s catalog_info

Returns:

A new dataset object with the arguments updated to those provided to the function, and the hc_structure metadata updated to match the new ddf

query(expr: str) Catalog[source]#

Filters catalog and respective margin, if it exists, using a complex query expression

Parameters:

expr (str) – Query expression to evaluate. The column names that are not valid Python variables names should be wrapped in backticks, and any variable values can be injected using f-strings. The use of ‘@’ to reference variables is not supported. More information about pandas query strings is available here.

Returns:

A catalog that contains the data from the original catalog that complies with the query expression. If a margin exists, it is filtered according to the same query expression.

assign(**kwargs) Catalog[source]#

Assigns new columns to a catalog

Parameters:

**kwargs – Arguments to pass to the assign method. This dictionary should contain the column names as keys and either a function or a 1-D Dask array as their corresponding value.

Returns:

The catalog containing both the old columns and the newly created columns

Examples

Create a new column using a function:

catalog = Catalog(...)
catalog = catalog.assign(new_col=lambda df: df['existing_col'] * 2)

Add a column from a 1-D Dask array:

import dask.array as da
new_data = da.arange(...)
catalog = catalog.assign(new_col=new_data)
crossmatch(other: Catalog, suffixes: tuple[str, str] | None = None, algorithm: Type[lsdb.core.crossmatch.abstract_crossmatch_algorithm.AbstractCrossmatchAlgorithm] | lsdb.core.crossmatch.crossmatch_algorithms.BuiltInCrossmatchAlgorithm = BuiltInCrossmatchAlgorithm.KD_TREE, output_catalog_name: str | None = None, require_right_margin: bool = False, **kwargs) Catalog[source]#

Perform a cross-match between two catalogs

The pixels from each catalog are aligned via a PixelAlignment, and cross-matching is performed on each pair of overlapping pixels. The resulting catalog will have partitions matching an inner pixel alignment - using pixels that have overlap in both input catalogs and taking the smallest of any overlapping pixels.

The resulting catalog will be partitioned using the left catalog’s ra and dec, and the index for each row will be the same as the index from the corresponding row in the left catalog’s index.

Parameters:
  • other (Catalog) – The right catalog to cross-match against

  • suffixes (Tuple[str, str]) – A pair of suffixes to be appended to the end of each column name when they are joined. Default: uses the name of the catalog for the suffix

  • algorithm (BuiltInCrossmatchAlgorithm | Type[AbstractCrossmatchAlgorithm]) –

    The algorithm to use to perform the crossmatch. Can be either a string to specify one of the built-in cross-matching methods, or a custom method defined by subclassing AbstractCrossmatchAlgorithm.

    Built-in methods:
    • kd_tree: find the k-nearest neighbors using a kd_tree

    Custom function:

    To specify a custom function, write a class that subclasses the AbstractCrossmatchAlgorithm class, and overwrite the perform_crossmatch function.

    The function should be able to perform a crossmatch on two pandas DataFrames from a partition from each catalog. It should return two 1d numpy arrays of equal lengths with the indices of the matching rows from the left and right dataframes, and a dataframe with any extra columns generated by the crossmatch algorithm, also with the same length. These columns are specified in {AbstractCrossmatchAlgorithm.extra_columns}, with their respective data types, by means of an empty pandas dataframe. As an example, the KdTreeCrossmatch algorithm outputs a “_dist_arcsec” column with the distance between data points. Its extra_columns attribute is specified as follows:

    pd.DataFrame({"_dist_arcsec": pd.Series(dtype=np.dtype("float64"))})
    

    The class will have been initialized with the following parameters, which the crossmatch function should use:

    • left: npd.NestedFrame,

    • right: npd.NestedFrame,

    • left_order: int,

    • left_pixel: int,

    • right_order: int,

    • right_pixel: int,

    • left_metadata: hc.catalog.Catalog,

    • right_metadata: hc.catalog.Catalog,

    • right_margin_hc_structure: hc.margin.MarginCatalog,

    • suffixes: Tuple[str, str]

    You may add any additional keyword argument parameters to the crossmatch function definition, and the user will be able to pass them in as kwargs in the Catalog.crossmatch method. Any additional keyword arguments must also be added to the CrossmatchAlgorithm.validate classmethod by overwriting the method.

  • output_catalog_name (str) – The name of the resulting catalog. Default: {left_name}_x_{right_name}

  • require_right_margin (bool) – If true, raises an error if the right margin is missing which could lead to incomplete crossmatches. Default: False

Returns:

A Catalog with the data from the left and right catalogs merged with one row for each pair of neighbors found from cross-matching.

The resulting table contains all columns from the left and right catalogs with their respective suffixes and, whenever specified, a set of extra columns generated by the crossmatch algorithm.

merge_map(map_catalog: lsdb.catalog.map_catalog.MapCatalog, func: Callable[Ellipsis, nested_pandas.NestedFrame], *args, meta: nested_pandas.NestedFrame | None = None, **kwargs) Catalog[source]#

Applies a function to each pair of partitions in this catalog and the map catalog.

The pixels from each catalog are aligned via a PixelAlignment, and the respective dataframes are passed to the function. The resulting catalog will have the same partitions as the point source catalog.

Parameters:
  • map_catalog (MapCatalog) – The continuous map to merge.

  • func (Callable) – The function applied to each catalog partition, which will be called with: func(catalog_partition: npd.NestedFrame, map_partition: npd.NestedFrame, ` ` healpix_pixel: HealpixPixel, *args, **kwargs) with the additional args and kwargs passed to the merge_map function.

  • *args – Additional positional arguments to call func with.

  • meta (pd.DataFrame | pd.Series | Dict | Iterable | Tuple | None) – An empty pandas DataFrame that has columns matching the output of the function applied to the catalog partition. Other types are accepted to describe the output dataframe format, for full details see the dask documentation https://blog.dask.org/2022/08/09/understanding-meta-keyword-argument If meta is None (default), LSDB will try to work out the output schema of the function by calling the function with an empty DataFrame. If the function does not work with an empty DataFrame, this will raise an error and meta must be set. Note that some operations in LSDB will generate empty partitions, though these can be removed by calling the Catalog.prune_empty_partitions method.

  • **kwargs – Additional keyword args to pass to the function. These are passed to the Dask DataFrame dask.dataframe.map_partitions function, so any of the dask function’s keyword args such as transform_divisions will be passed through and work as described in the dask documentation https://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.map_partitions.html

Returns:

A Catalog with the data from the left and right catalogs merged with one row for each pair of neighbors found from cross-matching.

The resulting table contains all columns from the left and right catalogs with their respective suffixes and, whenever specified, a set of extra columns generated by the crossmatch algorithm.

Perform a cone search to filter the catalog

Filters to points within radius great circle distance to the point specified by ra and dec in degrees. Filters partitions in the catalog to those that have some overlap with the cone.

Parameters:
  • ra (float) – Right Ascension of the center of the cone in degrees

  • dec (float) – Declination of the center of the cone in degrees

  • radius_arcsec (float) – Radius of the cone in arcseconds

  • fine (bool) – True if points are to be filtered, False if not. Defaults to True.

Returns:

A new Catalog containing the points filtered to those within the cone, and the partitions that overlap the cone.

Performs filtering according to right ascension and declination ranges. The right ascension edges follow great arc circles and the declination edges follow small arc circles.

Filters to points within the region specified in degrees. Filters partitions in the catalog to those that have some overlap with the region.

Parameters:
  • ra (Tuple[float, float]) – The right ascension minimum and maximum values.

  • dec (Tuple[float, float]) – The declination minimum and maximum values.

  • fine (bool) – True if points are to be filtered, False if not. Defaults to True.

Returns:

A new catalog containing the points filtered to those within the region, and the partitions that have some overlap with it.

Perform a polygonal search to filter the catalog.

Filters to points within the polygonal region specified in ra and dec, in degrees. Filters partitions in the catalog to those that have some overlap with the region.

Parameters:
  • vertices (list[tuple[float, float]]) – The list of vertices of the polygon to filter pixels with, as a list of (ra,dec) coordinates, in degrees.

  • fine (bool) – True if points are to be filtered, False if not. Defaults to True.

Returns:

A new catalog containing the points filtered to those within the polygonal region, and the partitions that have some overlap with it.

Find rows by ids (or other value indexed by a catalog index).

Filters partitions in the catalog to those that could contain the ids requested. Filters to points that have matching values in the id field.

NB: This requires a previously-computed catalog index table.

Parameters:
  • ids – Values to search for.

  • catalog_index (HCIndexCatalog) – A pre-computed hats index catalog.

  • fine (bool) – True if points are to be filtered, False if not. Defaults to True.

Returns:

A new Catalog containing the points filtered to those matching the ids.

Filter catalog by order of HEALPix.

Parameters:
  • min_order (int) – Minimum HEALPix order to select. Defaults to 0.

  • max_order (int) – Maximum HEALPix order to select. Defaults to maximum catalog order.

Returns:

A new Catalog containing only the pixels of orders specified (inclusive).

Finds all catalog pixels that overlap with the requested pixel set.

Parameters:

pixels (List[Tuple[int, int]]) – The list of HEALPix tuples (order, pixel) that define the region for the search.

Returns:

A new Catalog containing only the pixels that overlap with the requested pixel set.

Finds all catalog points that are contained within a moc.

Parameters:
  • moc (mocpy.MOC) – The moc that defines the region for the search.

  • fine (bool) – True if points are to be filtered, False if only partitions. Defaults to True.

Returns:

A new Catalog containing only the points that are within the moc.

search(search: lsdb.core.search.abstract_search.AbstractSearch)[source]#

Find rows by reusable search algorithm.

Filters partitions in the catalog to those that match some rough criteria. Filters to points that match some finer criteria.

Parameters:

search (AbstractSearch) – Instance of AbstractSearch.

Returns:

A new Catalog containing the points filtered to those matching the search parameters.

map_partitions(func: Callable[Ellipsis, nested_pandas.NestedFrame], *args, meta: pandas.DataFrame | pandas.Series | dict | Iterable | tuple | None = None, include_pixel: bool = False, **kwargs) Catalog | dask.dataframe.Series[source]#

Applies a function to each partition in the catalog and respective margin.

The ra and dec of each row is assumed to remain unchanged. If the function returns a DataFrame, an LSDB Catalog is constructed and its respective margin is updated accordingly, if it exists. Otherwise, only the main catalog Dask object is returned.

Parameters:
  • func (Callable) – The function applied to each partition, which will be called with: func(partition: npd.NestedFrame, *args, **kwargs) with the additional args and kwargs passed to the map_partitions function. If the include_pixel parameter is set, the function will be called with the healpix_pixel as the second positional argument set to the healpix pixel of the partition as func(partition: npd.NestedFrame, healpix_pixel: HealpixPixel, *args, **kwargs)

  • *args – Additional positional arguments to call func with.

  • meta (pd.DataFrame | pd.Series | Dict | Iterable | Tuple | None) – An empty pandas DataFrame that has columns matching the output of the function applied to a partition. Other types are accepted to describe the output dataframe format, for full details see the dask documentation https://blog.dask.org/2022/08/09/understanding-meta-keyword-argument If meta is None (default), LSDB will try to work out the output schema of the function by calling the function with an empty DataFrame. If the function does not work with an empty DataFrame, this will raise an error and meta must be set. Note that some operations in LSDB will generate empty partitions, though these can be removed by calling the Catalog.prune_empty_partitions method.

  • include_pixel (bool) – Whether to pass the Healpix Pixel of the partition as a HealpixPixel object to the second positional argument of the function

  • **kwargs – Additional keyword args to pass to the function. These are passed to the Dask DataFrame dask.dataframe.map_partitions function, so any of the dask function’s keyword args such as transform_divisions will be passed through and work as described in the dask documentation https://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.map_partitions.html

Returns:

A new catalog with each partition replaced with the output of the function applied to the original partition. If the function returns a non dataframe output, a dask Series will be returned.

merge(other: Catalog, how: str = 'inner', on: str | list | None = None, left_on: str | list | None = None, right_on: str | list | None = None, left_index: bool = False, right_index: bool = False, suffixes: tuple[str, str] | None = None) nested_dask.NestedFrame[source]#

Performs the merge of two catalog Dataframes

More information about pandas merge is available here.

Parameters:
  • other (Catalog) – The right catalog to merge with.

  • how (str) – How to handle the merge of the two catalogs. One of {‘left’, ‘right’, ‘outer’, ‘inner’}, defaults to ‘inner’.

  • on (str | List) – Column or index names to join on. Defaults to the intersection of columns in both Dataframes if on is None and not merging on indexes.

  • left_on (str | List) – Column to join on the left Dataframe. Lists are supported if their length is one.

  • right_on (str | List) – Column to join on the right Dataframe. Lists are supported if their length is one.

  • left_index (bool) – Use the index of the left Dataframe as the join key. Defaults to False.

  • right_index (bool) – Use the index of the right Dataframe as the join key. Defaults to False.

  • suffixes (Tuple[str, str]) – A pair of suffixes to be appended to the end of each column name when they are joined. Defaults to using the name of the catalog for the suffix.

Returns:

A new Dask Dataframe containing the data points that result from the merge of the two catalogs.

merge_asof(other: Catalog, direction: str = 'backward', suffixes: tuple[str, str] | None = None, output_catalog_name: str | None = None)[source]#

Uses the pandas merge_asof function to merge two catalogs on their indices by distance of keys

Must be along catalog indices, and does not include margin caches, meaning results may be incomplete for merging points.

This function is intended for use in special cases such as Dust Map Catalogs, for general merges, the crossmatch`and `join functions should be used.

Parameters:
  • other (lsdb.Catalog) – the right catalog to merge to

  • suffixes (Tuple[str,str]) – the suffixes to apply to each partition’s column names

  • direction (str) – the direction to perform the merge_asof

Returns:

A new catalog with the columns from each of the input catalogs with their respective suffixes added, and the rows merged using merge_asof on the specified columns.

join(other: Catalog, left_on: str | None = None, right_on: str | None = None, through: lsdb.catalog.association_catalog.AssociationCatalog | None = None, suffixes: tuple[str, str] | None = None, output_catalog_name: str | None = None) Catalog[source]#

Perform a spatial join to another catalog

Joins two catalogs together on a shared column value, merging rows where they match. The operation only joins data from matching partitions, and does not join rows that have a matching column value but are in separate partitions in the sky. For a more general join, see the merge function.

Parameters:
  • other (Catalog) – the right catalog to join to

  • left_on (str) – the name of the column in the left catalog to join on

  • right_on (str) – the name of the column in the right catalog to join on

  • through (AssociationCatalog) – an association catalog that provides the alignment between pixels and individual rows.

  • suffixes (Tuple[str,str]) – suffixes to apply to the columns of each table

  • output_catalog_name (str) – The name of the resulting catalog to be stored in metadata

Returns:

A new catalog with the columns from each of the input catalogs with their respective suffixes added, and the rows merged on the specified columns.

join_nested(other: Catalog, left_on: str | None = None, right_on: str | None = None, nested_column_name: str | None = None, output_catalog_name: str | None = None) Catalog[source]#

Perform a spatial join to another catalog by adding the other catalog as a nested column

Joins two catalogs together on a shared column value, merging rows where they match.

The result is added as a nested dataframe column using nested-dask, where the right catalog’s columns are encoded within a column in the resulting dataframe. For more information, view the nested-dask documentation.

The operation only joins data from matching partitions and their margin caches, and does not join rows that have a matching column value but are in separate partitions in the sky. For a more general join, see the merge function.

Parameters:
  • other (Catalog) – the right catalog to join to

  • left_on (str) – the name of the column in the left catalog to join on

  • right_on (str) – the name of the column in the right catalog to join on

  • nested_column_name (str) – the name of the nested column in the resulting dataframe storing the joined columns in the right catalog. (Default: name of right catalog)

  • output_catalog_name (str) – The name of the resulting catalog to be stored in metadata

Returns:

A new catalog with the columns from each of the input catalogs with their respective suffixes added, and the rows merged on the specified columns.

nest_lists(base_columns: list[str] | None, list_columns: list[str] | None = None, name: str = 'nested') Catalog[source]#

Creates a new catalog with a set of list columns packed into a nested column.

Parameters:
  • base_columns (list-like or None) – Any columns that have non-list values in the input catalog. These will simply be kept as identical columns in the result

  • list_columns (list-like or None) – The list-value columns that should be packed into a nested column. All columns in the list will attempt to be packed into a single nested column with the name provided in nested_name. All columns in list_columns must have pyarrow list dtypes, otherwise the operation will fail. If None, is defined as all columns not in base_columns.

  • name (str) – The name of the output column the nested_columns are packed into.

Returns:

A new catalog with specified list columns nested into a new nested column.

Note

As noted above, all columns in list_columns must have a pyarrow ListType dtype. This is needed for proper meta propagation. To convert a list column to this dtype, you can use this command structure: nf= nf.astype({“colname”: pd.ArrowDtype(pa.list_(pa.int64()))}) Where pa.int64 above should be replaced with the correct dtype of the underlying data accordingly. Additionally, it’s a known issue in Dask (dask/dask#10139) that columns with list values will by default be converted to the string type. This will interfere with the ability to recast these to pyarrow lists. We recommend setting the following dask config setting to prevent this: dask.config.set({“dataframe.convert-string”:False})

dropna(*, axis: pandas._typing.Axis = 0, how: pandas._typing.AnyAll | pandas._libs.lib.NoDefault = no_default, thresh: int | pandas._libs.lib.NoDefault = no_default, on_nested: bool = False, subset: pandas._typing.IndexLabel | None = None, ignore_index: bool = False) Catalog[source]#

Remove missing values for one layer of nested columns in the catalog.

Parameters:
  • axis ({0 or 'index', 1 or 'columns'}, default 0) –

    Determine if rows or columns which contain missing values are removed.

    • 0, or ‘index’ : Drop rows which contain missing values.

    • 1, or ‘columns’ : Drop columns which contain missing value.

    Only a single axis is allowed.

  • how ({'any', 'all'}, default 'any') –

    Determine if row or column is removed from catalog, when we have at least one NA or all NA.

    • ’any’ : If any NA values are present, drop that row or column.

    • ’all’ : If all values are NA, drop that row or column.

  • thresh (int, optional) – Require that many non-NA values. Cannot be combined with how.

  • on_nested (str or bool, optional) – If not False, applies the call to the nested dataframe in the column with label equal to the provided string. If specified, the nested dataframe should align with any columns given in subset.

  • subset (column label or sequence of labels, optional) –

    Labels along other axis to consider, e.g. if you are dropping rows these would be a list of columns to include.

    Access nested columns using nested_df.nested_col (where nested_df refers to a particular nested dataframe and nested_col is a column of that nested dataframe).

  • ignore_index (bool, default False) – If True, the resulting axis will be labeled 0, 1, …, n - 1.

Returns:

Catalog with NA entries dropped from it.

Return type:

Catalog

Notes

Operations that target a particular nested structure return a dataframe with rows of that particular nested structure affected.

Values for on_nested and subset should be consistent in pointing to a single layer, multi-layer operations are not supported at this time.

reduce(func, *args, meta=None, **kwargs) Catalog[source]#

Takes a function and applies it to each top-level row of the Catalog.

docstring copied from nested-pandas

The user may specify which columns the function is applied to, with columns from the ‘base’ layer being passsed to the function as scalars and columns from the nested layers being passed as numpy arrays.

Parameters:
  • func (callable) – Function to apply to each nested dataframe. The first arguments to func should be which columns to apply the function to. See the Notes for recommendations on writing func outputs.

  • args (positional arguments) – Positional arguments to pass to the function, the first *args should be the names of the columns to apply the function to.

  • meta (dataframe or series-like, optional) – The dask meta of the output. If append_columns is True, the meta should specify just the additional columns output by func.

  • append_columns (bool) – If the output columns should be appended to the orignal dataframe.

  • kwargs (keyword arguments, optional) – Keyword arguments to pass to the function.

Returns:

HealpixDataset with the results of the function applied to the columns of the frame.

Return type:

HealpixDataset

Notes

By default, reduce will produce a NestedFrame with enumerated column names for each returned value of the function. For more useful naming, it’s recommended to have func return a dictionary where each key is an output column of the dataframe returned by reduce.

Example User Function:

>>> def my_sum(col1, col2):
>>>    '''reduce will return a NestedFrame with two columns'''
>>>    return {"sum_col1": sum(col1), "sum_col2": sum(col2)}
>>>
>>> catalog.reduce(my_sum, 'sources.col1', 'sources.col2')
sort_nested_values(by: str | list[str], ascending: bool | list[bool] = True, na_position: Literal['first'] | Literal['last'] = 'last', ignore_index: bool | None = False, **options) Catalog[source]#

Sort nested columns for each row in the catalog.

Note that this does NOT sort rows, only nested values within rows.

Parameters:
  • by – str or list[str] Column(s) to sort by.

  • ascending – bool or list[bool], optional Sort ascending vs. descending. Defaults to True. Specify list for multiple sort orders. If this is a list of bools, must match the length of the by.

  • na_position – {‘last’, ‘first’}, optional Puts NaNs at the beginning if ‘first’, puts NaN at the end if ‘last’. Defaults to ‘last’.

  • ignore_index – bool, optional If True, the resulting axis will be labeled 0, 1, …, n - 1. Defaults to False.

  • **options – keyword arguments, optional Additional options to pass to the sorting function.

Returns:

A new catalog where the specified nested columns are sorted.

class MarginCatalog(ddf: nested_dask.NestedFrame, ddf_pixel_map: lsdb.types.DaskDFPixelMap, hc_structure: hats.catalog.MarginCatalog)[source]#

Bases: lsdb.catalog.dataset.healpix_dataset.HealpixDataset

LSDB Catalog DataFrame to contain the “margin” of another HATS catalog. spatial operations.

hc_structure#

hats.MarginCatalog object representing the structure and metadata of the HATS catalog

hc_structure: hats.catalog.MarginCatalog#
crossmatch(left: lsdb.catalog.Catalog | nested_pandas.NestedFrame | pandas.DataFrame, right: lsdb.catalog.Catalog | nested_pandas.NestedFrame | pandas.DataFrame, ra_column: str | None = None, dec_column: str | None = None, suffixes: tuple[str, str] | None = None, algorithm: Type[lsdb.core.crossmatch.abstract_crossmatch_algorithm.AbstractCrossmatchAlgorithm] | lsdb.core.crossmatch.crossmatch_algorithms.BuiltInCrossmatchAlgorithm = BuiltInCrossmatchAlgorithm.KD_TREE, output_catalog_name: str | None = None, require_right_margin: bool = False, left_args: dict | None = None, right_args: dict | None = None, **kwargs) lsdb.catalog.Catalog[source]#

Perform a cross-match between two frames, two catalogs, a catalog and a frame, or a frame and a catalog.

See Catalog.crossmatch for more information on cross-matching.

Parameters:
  • left (Catalog | NestedFrame) – The left catalog or frame to crossmatch.

  • right (Catalog | NestedFrame) – The right catalog or frame to crossmatch.

  • ra_column (str, optional) – The name of the right ascension column for both catalogs, if passing dataframes. Defaults to None.

  • dec_column (str, optional) – The name of the declination column for both catalogs, if passing dataframes. Defaults to None.

  • suffixes (tuple[str, str], optional) – Suffixes to append to overlapping column names. Defaults to None.

  • algorithm (Type[AbstractCrossmatchAlgorithm] | BuiltInCrossmatchAlgorithm, optional) – The crossmatch algorithm to use. Defaults to BuiltInCrossmatchAlgorithm.KD_TREE.

  • output_catalog_name (str, optional) – The name of the output catalog. Defaults to None.

  • require_right_margin (bool, optional) – Whether to require a right margin. Defaults to False.

  • left_args (dict, optional) – Keyword arguments to pass to from_dataframe for the left catalog. Defaults to None.

  • right_args (dict, optional) – Keyword arguments to pass to from_dataframe for the right catalog. Defaults to None.

  • **kwargs – Additional keyword arguments to pass to Catalog.crossmatch.

Returns:

The crossmatched catalog.

Return type:

Catalog

class BoxSearch(ra: tuple[float, float], dec: tuple[float, float], fine: bool = True)[source]#

Bases: lsdb.core.search.abstract_search.AbstractSearch

Perform a box search to filter the catalog. This type of search is used for a range of right ascension or declination, where the right ascension edges follow great arc circles and the declination edges follow small arc circles.

Filters to points within the ra / dec region, specified in degrees. Filters partitions in the catalog to those that have some overlap with the region.

filter_hc_catalog(hc_structure: lsdb.types.HCCatalogTypeVar) mocpy.MOC[source]#

Filters catalog pixels according to the box

search_points(frame: nested_pandas.NestedFrame, metadata: hats.catalog.TableProperties) nested_pandas.NestedFrame[source]#

Determine the search results within a data frame

class ConeSearch(ra: float, dec: float, radius_arcsec: float, fine: bool = True)[source]#

Bases: lsdb.core.search.abstract_search.AbstractSearch

Perform a cone search to filter the catalog

Filters to points within radius great circle distance to the point specified by ra and dec in degrees. Filters partitions in the catalog to those that have some overlap with the cone.

ra#
dec#
radius_arcsec#
filter_hc_catalog(hc_structure: lsdb.types.HCCatalogTypeVar) mocpy.MOC[source]#

Filters catalog pixels according to the cone

search_points(frame: nested_pandas.NestedFrame, metadata: hats.catalog.TableProperties) nested_pandas.NestedFrame[source]#

Determine the search results within a data frame

_perform_plot(ax: astropy.visualization.wcsaxes.WCSAxes, **kwargs)[source]#

Perform the plot of the search region on an initialized WCSAxes

class PolygonSearch(vertices: list[tuple[float, float]], fine: bool = True)[source]#

Bases: lsdb.core.search.abstract_search.AbstractSearch

Perform a polygonal search to filter the catalog.

Filters to points within the polygonal region specified in ra and dec, in degrees. Filters partitions in the catalog to those that have some overlap with the region.

vertices#
polygon#
filter_hc_catalog(hc_structure: lsdb.types.HCCatalogTypeVar) lsdb.types.HCCatalogTypeVar[source]#

Filters catalog pixels according to the polygon

search_points(frame: nested_pandas.NestedFrame, metadata: hats.catalog.TableProperties) nested_pandas.NestedFrame[source]#

Determine the search results within a data frame

from_dataframe(dataframe: pandas.DataFrame, *, ra_column: str = 'ra', dec_column: str = 'dec', lowest_order: int = 0, highest_order: int = 7, drop_empty_siblings: bool = True, partition_size: int | None = None, threshold: int | None = None, margin_order: int = -1, margin_threshold: float | None = 5.0, should_generate_moc: bool = True, moc_max_order: int = 10, use_pyarrow_types: bool = True, schema: pyarrow.Schema | None = None, **kwargs) lsdb.catalog.Catalog[source]#

Load a catalog from a Pandas Dataframe.

Note that this is only suitable for small datasets (< 1million rows and < 1GB dataframe in-memory). If you need to deal with large datasets, consider using the hats-import package: https://hats-import.readthedocs.io/

Parameters:
  • dataframe (pd.Dataframe) – The catalog Pandas Dataframe.

  • ra_column (str) – The name of the right ascension column. Defaults to ra.

  • dec_column (str) – The name of the declination column. Defaults to dec.

  • lowest_order (int) – The lowest partition order. Defaults to 0.

  • highest_order (int) – The highest partition order. Defaults to 7.

  • drop_empty_siblings (bool) – When determining final partitionining, if 3 of 4 pixels are empty, keep only the non-empty pixel

  • partition_size (int) – The desired partition size, in number of bytes in-memory.

  • threshold (int) – The maximum number of data points per pixel.

  • margin_order (int) – The order at which to generate the margin cache.

  • margin_threshold (float) – The size of the margin cache boundary, in arcseconds. If None, and margin order is not specified, the margin cache is not generated. Defaults to 5 arcseconds.

  • should_generate_moc (bool) – should we generate a MOC (multi-order coverage map) of the data. can improve performance when joining/crossmatching to other hats-sharded datasets.

  • moc_max_order (int) – if generating a MOC, what to use as the max order. Defaults to 10.

  • use_pyarrow_types (bool) – If True, the data is backed by pyarrow, otherwise we keep the original data types. Defaults to True.

  • schema (pa.Schema) – the arrow schema to create the catalog with. If None, the schema is automatically inferred from the provided DataFrame using pa.Schema.from_pandas.

  • **kwargs – Arguments to pass to the creation of the catalog info.

Returns:

Catalog object loaded from the given parameters

read_hats(path: str | pathlib.Path | upath.UPath, search_filter: lsdb.core.search.abstract_search.AbstractSearch | None = None, columns: list[str] | str | None = None, margin_cache: str | pathlib.Path | upath.UPath | None = None, dtype_backend: str | None = 'pyarrow', **kwargs) lsdb.types.CatalogTypeVar | None[source]#

Load a catalog from a HATS formatted catalog.

Typical usage example, where we load a catalog with a subset of columns:

lsdb.read_hats(path="./my_catalog_dir", columns=["ra","dec"])

Typical usage example, where we load a catalog from a cone search:

lsdb.read_hats(
    path="./my_catalog_dir",
    columns=["ra","dec"],
    search_filter=lsdb.core.search.ConeSearch(ra, dec, radius_arcsec),
)
Parameters:
  • path (UPath | Path) – The path that locates the root of the HATS catalog

  • search_filter (Type[AbstractSearch]) – Default None. The filter method to be applied.

  • columns (List[str]) – Default None. The set of columns to filter the catalog on. If None, the catalog’s default columns will be loaded. To load all catalog columns, use columns=”all”

  • margin_cache (path-like) – Default None. The margin for the main catalog, provided as a path.

  • dtype_backend (str) – Backend data type to apply to the catalog. Defaults to “pyarrow”. If None, no type conversion is performed.

  • **kwargs – Arguments to pass to the pandas parquet file reader

Returns:

Catalog object loaded from the given parameters

Examples

To read a catalog from a public S3 bucket, call it as follows:

from upath import UPath
catalog = lsdb.read_hats(UPath(..., anon=True))