lsdb.loaders.dataframe

lsdb.loaders.dataframe#

Submodules#

Functions#

from_dataframe(→ lsdb.catalog.Catalog)

Load a catalog from a Pandas Dataframe.

Package Contents#

from_dataframe(dataframe: pandas.DataFrame, *, ra_column: str = 'ra', dec_column: str = 'dec', lowest_order: int = 0, highest_order: int = 7, drop_empty_siblings: bool = True, partition_size: int | None = None, threshold: int | None = None, margin_order: int = -1, margin_threshold: float | None = 5.0, should_generate_moc: bool = True, moc_max_order: int = 10, use_pyarrow_types: bool = True, schema: pyarrow.Schema | None = None, **kwargs) lsdb.catalog.Catalog[source]#

Load a catalog from a Pandas Dataframe.

Note that this is only suitable for small datasets (< 1million rows and < 1GB dataframe in-memory). If you need to deal with large datasets, consider using the hats-import package: https://hats-import.readthedocs.io/

Parameters:
  • dataframe (pd.Dataframe) – The catalog Pandas Dataframe.

  • ra_column (str) – The name of the right ascension column. Defaults to ra.

  • dec_column (str) – The name of the declination column. Defaults to dec.

  • lowest_order (int) – The lowest partition order. Defaults to 0.

  • highest_order (int) – The highest partition order. Defaults to 7.

  • drop_empty_siblings (bool) – When determining final partitionining, if 3 of 4 pixels are empty, keep only the non-empty pixel

  • partition_size (int) – The desired partition size, in number of bytes in-memory.

  • threshold (int) – The maximum number of data points per pixel.

  • margin_order (int) – The order at which to generate the margin cache.

  • margin_threshold (float) – The size of the margin cache boundary, in arcseconds. If None, and margin order is not specified, the margin cache is not generated. Defaults to 5 arcseconds.

  • should_generate_moc (bool) – should we generate a MOC (multi-order coverage map) of the data. can improve performance when joining/crossmatching to other hats-sharded datasets.

  • moc_max_order (int) – if generating a MOC, what to use as the max order. Defaults to 10.

  • use_pyarrow_types (bool) – If True, the data is backed by pyarrow, otherwise we keep the original data types. Defaults to True.

  • schema (pa.Schema) – the arrow schema to create the catalog with. If None, the schema is automatically inferred from the provided DataFrame using pa.Schema.from_pandas.

  • **kwargs – Arguments to pass to the creation of the catalog info.

Returns:

Catalog object loaded from the given parameters