lsdb.nested#

Submodules#

Classes#

NestedFrame

An extension for a Dask Dataframe that has Nested-Pandas functionality.

Functions#

generate_catalog(n_base, n_layer[, seed, ra_range, ...])

Generates a toy catalog.

generate_data(n_base, n_layer[, npartitions, seed, ...])

Generates a toy dataset.

read_parquet([filesystem])

Read a Parquet file into a Dask DataFrame

count_nested(→ lsdb.nested.core.NestedFrame)

Counts the number of rows of a nested dataframe.

Package Contents#

class NestedFrame(expr)[source]#

Bases: _Frame, dask.dataframe.DataFrame

An extension for a Dask Dataframe that has Nested-Pandas functionality.

Examples

>>> import lsdb.nested as nd 
>>> base = nd.NestedFrame(base_data) 
>>> layer = nd.NestedFrame(layer_data) 
>>> base.add_nested(layer, "layer") 
_partition_type#
__getitem__(item)[source]#

Adds custom __getitem__ functionality for nested columns

__setitem__(key, value)[source]#

Adds custom __setitem__ behavior for nested columns

_repr_html_()[source]#
classmethod from_pandas(data, npartitions=None, chunksize=None, sort=True) NestedFrame[source]#

Returns an LSDB.nested NestedFrame constructed from a Nested-Pandas NestedFrame or Pandas DataFrame.

Parameters:
  • data (NestedFrame or DataFrame) – Nested-Pandas NestedFrame containing the underlying data

  • npartitions (int, optional) – The number of partitions of the index to create. Note that depending on the size and index of the dataframe, the output may have fewer partitions than requested.

  • chunksize (int, optional) – The desired number of rows per index partition to use. Note that depending on the size and index of the dataframe, actual partition sizes may vary.

  • sort (bool, optional) – Whether to sort the frame by a default index.

Returns:

result – The constructed Dask-Nested NestedFrame object.

Return type:

NestedFrame

classmethod from_dask_dataframe(df: dask.dataframe.DataFrame) NestedFrame[source]#

Converts a Dask Dataframe to a Dask-Nested NestedFrame

Parameters:

df – A Dask Dataframe to convert

Return type:

lsdb.nested.NestedFrame

classmethod from_delayed(dfs, meta=None, divisions=None, prefix='from-delayed', verify_meta=True)[source]#

Create LSDB.nested NestedFrames from many Dask Delayed objects.

Docstring is copied from dask.dataframe.from_delayed.

Parameters:
  • dfs – A dask.delayed.Delayed, a distributed.Future, or an iterable of either of these objects, e.g. returned by client.submit. These comprise the individual partitions of the resulting dataframe. If a single object is provided (not an iterable), then the resulting dataframe will have only one partition.

  • meta – An empty NestedFrame, pd.DataFrame, or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided (note that the order of the names should match the order of the columns). Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

  • divisions – Partition boundaries along the index. For tuple, see https://docs.dask.org/en/latest/dataframe-design.html#partitions For string ‘sorted’ will compute the delayed values to find index values. Assumes that the indexes are mutually sorted. If None, then won’t use index information

  • prefix – Prefix to prepend to the keys.

  • verify_meta – If True check that the partitions have consistent metadata, defaults to True.

classmethod from_map(func, *iterables, args=None, meta=None, divisions=None, label=None, enforce_metadata=True, **kwargs)[source]#

Create a DataFrame collection from a custom function map

WARNING: The from_map API is experimental, and stability is not yet guaranteed. Use at your own risk!

Parameters:
  • func (callable) – Function used to create each partition. If func satisfies the DataFrameIOFunction protocol, column projection will be enabled.

  • *iterables (Iterable objects) – Iterable objects to map to each output partition. All iterables must be the same length. This length determines the number of partitions in the output collection (only one element of each iterable will be passed to func for each partition).

  • args (list or tuple, optional) – Positional arguments to broadcast to each output partition. Note that these arguments will always be passed to func after the iterables positional arguments.

  • meta – An empty NestedFrame, pd.DataFrame, or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided (note that the order of the names should match the order of the columns). Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

  • divisions (tuple, str, optional) – Partition boundaries along the index. For tuple, see https://docs.dask.org/en/latest/dataframe-design.html#partitions For string ‘sorted’ will compute the delayed values to find index values. Assumes that the indexes are mutually sorted. If None, then won’t use index information

  • label (str, optional) – String to use as the function-name label in the output collection-key names.

  • enforce_metadata (bool, default True) – Whether to enforce at runtime that the structure of the DataFrame produced by func actually matches the structure of meta. This will rename and reorder columns for each partition, and will raise an error if this doesn’t work, but it won’t raise if dtypes don’t match.

  • **kwargs – Key-word arguments to broadcast to each output partition. These same arguments will be passed to func for every output partition.

classmethod from_flat(df, base_columns, nested_columns=None, on=None, name='nested')[source]#

Creates a NestedFrame with base and nested columns from a flat dataframe.

Parameters:
  • df (dd.DataFrame or nd.NestedFrame) – A flat dataframe.

  • base_columns (list-like) – The columns that should be used as base (flat) columns in the output dataframe.

  • nested_columns (list-like, or None) – The columns that should be packed into a nested column. All columns in the list will attempt to be packed into a single nested column with the name provided in nested_name. If None, is defined as all columns not in base_columns.

  • on (str or None) – The name of a column to use as the new index. Typically, the index should have a unique value per row for base columns, and should repeat for nested columns. For example, a dataframe with two columns; a=[1,1,1,2,2,2] and b=[5,10,15,20,25,30] would want an index like [0,0,0,1,1,1] if a is chosen as a base column. If not provided the current index will be used.

  • name – The name of the output column the nested_columns are packed into.

Returns:

A NestedFrame with the specified nesting structure.

Return type:

NestedFrame

classmethod from_lists(df, base_columns=None, list_columns=None, name='nested')[source]#

Creates a NestedFrame with base and nested columns from a flat dataframe.

Parameters:
  • df (dd.DataFrame or nd.NestedFrame) – A dataframe with list columns.

  • base_columns (list-like, or None) – Any columns that have non-list values in the input df. These will simply be kept as identical columns in the result

  • list_columns (list-like, or None) – The list-value columns that should be packed into a nested column. All columns in the list will attempt to be packed into a single nested column with the name provided in nested_name. All columns in list_columns must have pyarrow list dtypes, otherwise the operation will fail. If None, is defined as all columns not in base_columns.

  • name – The name of the output column the nested_columns are packed into.

Returns:

A NestedFrame with the specified nesting structure.

Return type:

NestedFrame

Note

As noted above, all columns in list_columns must have a pyarrow ListType dtype. This is needed for proper meta propagation. To convert a list column to this dtype, you can use this command structure: nf= nf.astype({“colname”: pd.ArrowDtype(pa.list_(pa.int64()))})

Where pa.int64 above should be replaced with the correct dtype of the underlying data accordingly.

Additionally, it’s a known issue in Dask (dask/dask#10139) that columns with list values will by default be converted to the string type. This will interfere with the ability to recast these to pyarrow lists. We recommend setting the following dask config setting to prevent this: dask.config.set({“dataframe.convert-string”:False})

compute(**kwargs)[source]#

Compute this Dask collection, returning the underlying dataframe or series.

property all_columns: dict#

returns a dictionary of columns for each base/nested dataframe

property nested_columns: list#

retrieves the base column names for all nested dataframes

_is_known_hierarchical_column(colname) bool[source]#

Determine whether a string is a known hierarchical column name

add_nested(nested, name, how='outer') NestedFrame[source]#

Packs a dataframe into a nested column

Parameters:
  • nested – A flat dataframe to pack into a nested column

  • name – The name given to the nested column

  • how ({‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’}, default ‘outer’) –

    How to handle the operation of the two objects.

    • left: use calling frame’s index (or column if on is specified)

    • right: use other’s index.

    • outer: form union of calling frame’s index (or column if on is

    specified) with other’s index, and sort it lexicographically.

    • inner: form intersection of calling frame’s index (or column if

    on is specified) with other’s index, preserving the order of the calling’s one.

    • cross: creates the cartesian product from both frames, preserves

    the order of the left keys.

Return type:

lsdb.nested.NestedFrame

query(expr) Self[source]#

Query the columns of a NestedFrame with a boolean expression. Specified queries can target nested columns in addition to the typical column set

Docstring copied from nested-pandas query

Parameters:

expr (str) –

The query string to evaluate.

Access nested columns using nested_df.nested_col (where nested_df refers to a particular nested dataframe and nested_col is a column of that nested dataframe).

You can refer to variables in the environment by prefixing them with an ‘@’ character like @a + b.

You can refer to column names that are not valid Python variable names by surrounding them in backticks. Thus, column names containing spaces or punctuations (besides underscores) or starting with digits must be surrounded by backticks. (For example, a column named “Area (cm^2)” would be referenced as `Area (cm^2)`). Column names which are Python keywords (like “list”, “for”, “import”, etc) cannot be used.

For example, if one of your columns is called a a and you want to sum it with b, your query should be `a a` + b.

Returns:

DataFrame resulting from the provided query expression.

Return type:

DataFrame

Notes

Queries that target a particular nested structure return a dataframe with rows of that particular nested structure filtered. For example, querying the NestedFrame “df” with nested structure “my_nested” as below will return all rows of df, but with mynested filtered by the condition:

>>> df.query("mynested.a > 2") 
dropna(*, axis: pandas._typing.Axis = 0, how: str | pandas._libs.lib.NoDefault = no_default, thresh: int | pandas._libs.lib.NoDefault = no_default, on_nested: bool = False, subset: pandas._typing.IndexLabel | None = None, inplace: bool = False, ignore_index: bool = False) Self[source]#

Remove missing values for one layer of the NestedFrame.

Parameters:
  • axis ({0 or 'index', 1 or 'columns'}, default 0) –

    Determine if rows or columns which contain missing values are removed.

    • 0, or ‘index’ : Drop rows which contain missing values.

    • 1, or ‘columns’ : Drop columns which contain missing value.

    Only a single axis is allowed.

  • how ({'any', 'all'}, default 'any') –

    Determine if row or column is removed from DataFrame, when we have at least one NA or all NA.

    • ’any’ : If any NA values are present, drop that row or column.

    • ’all’ : If all values are NA, drop that row or column.

  • thresh (int, optional) – Require that many non-NA values. Cannot be combined with how.

  • on_nested (str or bool, optional) – If not False, applies the call to the nested dataframe in the column with label equal to the provided string. If specified, the nested dataframe should align with any columns given in subset.

  • subset (column label or sequence of labels, optional) –

    Labels along other axis to consider, e.g. if you are dropping rows these would be a list of columns to include.

    Access nested columns using nested_df.nested_col (where nested_df refers to a particular nested dataframe and nested_col is a column of that nested dataframe).

  • inplace (bool, default False) – Whether to modify the DataFrame rather than creating a new one.

  • ignore_index (bool, default False) –

    If True, the resulting axis will be labeled 0, 1, …, n - 1.

    Added in version 2.0.0.

Returns:

DataFrame with NA entries dropped from it or None if inplace=True.

Return type:

DataFrame or None

Notes

Operations that target a particular nested structure return a dataframe with rows of that particular nested structure affected.

Values for on_nested and subset should be consistent in pointing to a single layer, multi-layer operations are not supported at this time.

sort_values(by: str | list[str], npartitions: int | None = None, ascending: bool | list[bool] = True, na_position: Literal['first'] | Literal['last'] = 'last', partition_size: float = 128000000.0, sort_function: collections.abc.Callable[[pandas.DataFrame], pandas.DataFrame] | None = None, sort_function_kwargs: collections.abc.Mapping[str, Any] | None = None, upsample: float = 1.0, ignore_index: bool | None = False, shuffle_method: str | None = None, **options) Self[source]#

Sort the dataset by a single column.

Sorting a parallel dataset requires expensive shuffles and is generally not recommended. See ‘set_index‘ for implementation details.

Parameters:#

by: str or list[str]

Column(s) to sort by.

npartitions: int, None, or ‘auto’

The ideal number of output partitions. If None, use the same as the input. If ‘auto’ then decide by memory use. Not used when sorting nested layers.

ascending: bool or list[bool], optional

Sort ascending vs. descending. Defaults to True. Specify list for multiple sort orders. If this is a list of bools, must match the length of the by.

na_position: {‘last’, ‘first’}, optional

Puts NaNs at the beginning if ‘first’, puts NaN at the end if ‘last’. Defaults to ‘last’.

partition_size: float, optional

The desired size of each partition in bytes. Defaults to 128e6 (128 MB). Not used in nested sorting.

sort_function: function, optional

Sorting function to use when sorting underlying partitions. If None, defaults to M.sort_values (the partition library’s implementation of sort_values). Not used when sorting nested layers.

sort_function_kwargs: dict, optional

Additional keyword arguments to pass to the partition sorting function. By default, by, ascending, and na_position are provided.

upsample: float, optional

Used to increase the number of samples for quantiles. Not used in nested sorting

ignore_index: bool, optional

If True, the resulting axis will be labeled 0, 1, …, n - 1. Defaults to False.

shuffle_method: str, optional

The method to use for shuffling data. Defaults to None. Not used in nested sorting

**options: keyword arguments, optional

Additional options to pass to the sorting function.

Returns:#

DataFrame

DataFrame with sorted values.

reduce(func, *args, meta=dsk_no_default, infer_nesting=True, **kwargs) NestedFrame[source]#

Takes a function and applies it to each top-level row of the NestedFrame.

docstring copied from nested-pandas

The user may specify which columns the function is applied to, with columns from the ‘base’ layer being passsed to the function as scalars and columns from the nested layers being passed as numpy arrays.

Parameters:
  • func (callable) – Function to apply to each nested dataframe. The first arguments to func should be which columns to apply the function to. See the Notes for recommendations on writing func outputs.

  • args (positional arguments) – A list of string column names to pull from the NestedFrame to pass along to the function. If the function has additional arguments, pass them as keyword arguments (e.g. arg_name=value)

  • meta (dataframe or series-like, optional) – The dask meta of the output. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended.

  • infer_nesting (bool, default True) – If True, the function will pack output columns into nested structures based on column names adhering to a nested naming scheme. E.g. “nested.b” and “nested.c” will be packed into a column called “nested” with columns “b” and “c”. If False, all outputs will be returned as base columns.

  • kwargs (keyword arguments, optional) – Keyword arguments to pass to the function.

Returns:

NestedFrame with the results of the function applied to the columns of the frame.

Return type:

NestedFrame

Notes

By default, reduce will produce a NestedFrame with enumerated column names for each returned value of the function. For more useful naming, it’s recommended to have func return a dictionary where each key is an output column of the dataframe returned by reduce.

Example User Function:

>>> def my_sum(col1, col2): 
>>>    '''reduce will return a NestedFrame with two columns''' 
>>>    return {"sum_col1": sum(col1), "sum_col2": sum(col2)} 

When using nesting inference (infer_nesting=True), the output may contain nested columns. In such cases, the meta should be provided with the appropriate dtype for these columns. For example, the following function, which produces a nested column “lc”:

>>> def complex_output(flux): 
>>>   return {"max_flux": np.max(flux), 
>>>           "lc.flux_quantiles": np.quantile(flux, [0.1, 0.2, 0.3, 0.4, 0.5]), 
>>>           "lc.labels": [0.1, 0.2, 0.3, 0.4, 0.5]} 

Would require the following meta:

>>> # create a NestedDtype for the nested column "lc"
>>> from nested_pandas.series.dtype import NestedDtype 
>>> lc_dtype = NestedDtype(pa.struct([pa.field("flux_quantiles",  
>>>                                   pa.list_(pa.float64())), 
>>>                                   pa.field("labels", pa.list_(pa.float64()))])) 
>>> # use the lc_dtype in meta creation
>>> result_meta = npd.NestedFrame({'max_flux':pd.Series([], dtype='float'), 
>>>                 'lc':pd.Series([], dtype=lc_dtype)}) 
to_parquet(path, by_layer=False, **kwargs) None[source]#

Creates parquet file(s) with the data of a NestedFrame, either as a single parquet file directory where each nested dataset is packed into its own column or as an individual parquet file directory for each layer.

Docstring copied from nested-pandas.

Note that here we always opt to use the pyarrow engine for writing parquet files.

Parameters:
  • path (str) – The path to the parquet directory to be written.

  • by_layer (bool, default True) –

    NOTE: by_layer=False will not reliably preserve divisions currently, be warned when using it that loading from such a dataset will likely require you to reset and set the index to generate divisions information.

    If False, writes the entire NestedFrame to a single parquet directory.

    If True, writes each layer to a separate parquet sub-directory within the directory specified by path. The filename for each outputted file will be named after its layer. For example for the base layer this is always “base”.

  • kwargs (keyword arguments, optional) – Keyword arguments to pass to the function.

Return type:

None

generate_catalog(n_base, n_layer, seed=None, ra_range=(0.0, 360.0), dec_range=(-90, 90), search_region=None, **kwargs)[source]#

Generates a toy catalog.

Parameters:
  • n_base (int) – The number of rows to generate for the base layer

  • n_layer (int, or dict) – The number of rows per n_base row to generate for a nested layer. Alternatively, a dictionary of layer label, layer_size pairs may be specified to created multiple nested columns with custom sizing.

  • seed (int) – A seed to use for random generation of data

  • ra_range (tuple) – A tuple of the min and max values for the ra column in degrees

  • dec_range (tuple) – A tuple of the min and max values for the dec column in degrees

  • search_region (AbstractSearch) – A search region to apply to the generated data. Currently supports the ConeSearch and BoxSearch regions. Note that if provided, this will override the ra_range and dec_range parameters.

  • **kwargs – Additional keyword arguments to pass to lsdb.from_dataframe.

Returns:

The constructed LSDB Catalog.

Return type:

Catalog

Examples

>>> from lsdb.nested.datasets import generate_catalog
>>> gen_cat = generate_catalog(10,100)
>>> gen_cat = generate_catalog(1000, 10, ra_range=(0.,10.), dec_range=(-5.,0.))

Constraining spatial ranges: >>> gen_cat = generate_data(10, 100, ra_range=(0., 10.), dec_range=(-5., 0.))

Using a search region: >>> from lsdb.core.search import ConeSearch # doctest: +SKIP >>> gen_cat = generate_data(10, 100, search_region=ConeSearch(5, 5, 1))

generate_data(n_base, n_layer, npartitions=1, seed=None, ra_range=(0.0, 360.0), dec_range=(-90, 90), search_region=None)[source]#

Generates a toy dataset.

Docstring copied from nested-pandas.

Parameters:
  • n_base (int) – The number of rows to generate for the base layer

  • n_layer (int, or dict) – The number of rows per n_base row to generate for a nested layer. Alternatively, a dictionary of layer label, layer_size pairs may be specified to created multiple nested columns with custom sizing.

  • npartitions (int) – The number of partitions to split the data into.

  • seed (int) – A seed to use for random generation of data

  • ra_range (tuple) – A tuple of the min and max values for the ra column in degrees

  • dec_range (tuple) – A tuple of the min and max values for the dec column in degrees

  • search_region (AbstractSearch) – A search region to apply to the generated data. Currently supports the ConeSearch, BoxSearch, and PixelSearch regions. Note that if provided, this will override the ra_range and dec_range parameters.

Returns:

The constructed Dask NestedFrame.

Return type:

NestedFrame

Examples

>>> from lsdb.nested.datasets import generate_data
>>> nf = generate_data(10,100)
>>> nf = generate_data(10, {"nested_a": 100, "nested_b": 200})

Constraining spatial ranges: >>> nf = generate_data(10, 100, ra_range=(0., 10.), dec_range=(-5., 0.))

Using a search region: >>> from lsdb.core.search import ConeSearch >>> nf = generate_data(10, 100, search_region=ConeSearch(5, 5, 1))

read_parquet(path, columns=None, filters=None, categories=None, index=None, storage_options=None, engine='auto', use_nullable_dtypes: bool | None = None, dtype_backend=None, calculate_divisions=None, ignore_metadata_file=False, metadata_task_size=None, split_row_groups='infer', blocksize='default', aggregate_files=None, parquet_file_extension=('.parq', '.parquet', '.pq'), filesystem=None, **kwargs) lsdb.nested.core.NestedFrame[source]#

Read a Parquet file into a Dask DataFrame

This reads a directory of Parquet data into a Dask.dataframe, one file per partition. It selects the index among the sorted columns if any exist.

Docstring copied from dask.dataframe.read_parquet

Parameters:
  • path (str or list) – Source directory for data, or path(s) to individual parquet files. Prefix with a protocol like s3:// to read from alternative filesystems. To read from multiple files you can pass a globstring or a list of paths, with the caveat that they must all have the same protocol.

  • columns (str or list, default None) – Field name(s) to read in as columns in the output. By default all non-index fields will be read (as determined by the pandas parquet metadata, if present). Provide a single field name instead of a list to read in the data as a Series.

  • filters (Union[List[Tuple[str, str, Any]], List[List[Tuple[str, str, Any]]]], default None) –

    List of filters to apply, like [[('col1', '==', 0), ...], ...]. Using this argument will result in row-wise filtering of the final partitions.

    Predicates can be expressed in disjunctive normal form (DNF). This means that the inner-most tuple describes a single column predicate. These inner predicates are combined with an AND conjunction into a larger predicate. The outer-most list then combines all of the combined filters with an OR disjunction.

    Predicates can also be expressed as a List[Tuple]. These are evaluated as an AND conjunction. To express OR in predicates, one must use the (preferred for “pyarrow”) List[List[Tuple]] notation.

  • index (str, list or False, default None) – Field name(s) to use as the output frame index. By default will be inferred from the pandas parquet file metadata, if present. Use False to read all fields as columns.

  • categories (list or dict, default None) – For any fields listed here, if the parquet encoding is Dictionary, the column will be created with dtype category. Use only if it is guaranteed that the column is encoded as dictionary in all row-groups. If a list, assumes up to 2**16-1 labels; if a dict, specify the number of labels expected; if None, will load categories automatically for data written by dask, not otherwise.

  • storage_options (dict, default None) – Key/value pairs to be passed on to the file-system backend, if any. Note that the default file-system backend can be configured with the filesystem argument, described below.

  • open_file_options (dict, default None) – Key/value arguments to be passed along to AbstractFileSystem.open when each parquet data file is open for reading. Experimental (optimized) “precaching” for remote file systems (e.g. S3, GCS) can be enabled by adding {"method": "parquet"} under the "precache_options" key. Also, a custom file-open function can be used (instead of AbstractFileSystem.open), by specifying the desired function under the "open_file_func" key.

  • engine ({'auto', 'pyarrow'}) – Parquet library to use. Defaults to ‘auto’, which uses pyarrow if it is installed, and falls back to the deprecated fastparquet otherwise. Note that fastparquet does not support all functionality offered by pyarrow. This is also used by third-party packages (e.g. CuDF) to inject bespoke engines.

  • use_nullable_dtypes ({False, True}) –

    Whether to use extension dtypes for the resulting DataFrame.

    Note

    This option is deprecated. Use “dtype_backend” instead.

  • dtype_backend ({'numpy_nullable', 'pyarrow'}, defaults to NumPy backed DataFrames) – Which dtype_backend to use, e.g. whether a DataFrame should have NumPy arrays, nullable dtypes are used for all dtypes that have a nullable implementation when ‘numpy_nullable’ is set, pyarrow is used for all dtypes if ‘pyarrow’ is set. dtype_backend="pyarrow" requires pandas 1.5+.

  • calculate_divisions (bool, default False) – Whether to use min/max statistics from the footer metadata (or global _metadata file) to calculate divisions for the output DataFrame collection. Divisions will not be calculated if statistics are missing. This option will be ignored if index is not specified and there is no physical index column specified in the custom “pandas” Parquet metadata. Note that calculate_divisions=True may be extremely slow when no global _metadata file is present, especially when reading from remote storage. Set this to True only when known divisions are needed for your workload (see dataframe-design-partitions).

  • ignore_metadata_file (bool, default False) – Whether to ignore the global _metadata file (when one is present). If True, or if the global _metadata file is missing, the parquet metadata may be gathered and processed in parallel. Parallel metadata processing is currently supported for ArrowDatasetEngine only.

  • metadata_task_size (int, default configurable) – If parquet metadata is processed in parallel (see ignore_metadata_file description above), this argument can be used to specify the number of dataset files to be processed by each task in the Dask graph. If this argument is set to 0, parallel metadata processing will be disabled. The default values for local and remote filesystems can be specified with the “metadata-task-size-local” and “metadata-task-size-remote” config fields, respectively (see “dataframe.parquet”).

  • split_row_groups ('infer', 'adaptive', bool, or int, default 'infer') – If True, then each output dataframe partition will correspond to a single parquet-file row-group. If False, each partition will correspond to a complete file. If a positive integer value is given, each dataframe partition will correspond to that number of parquet row-groups (or fewer). If ‘adaptive’, the metadata of each file will be used to ensure that every partition satisfies blocksize. If ‘infer’ (the default), the uncompressed storage-size metadata in the first file will be used to automatically set split_row_groups to either ‘adaptive’ or False.

  • blocksize (int or str, default 'default') – The desired size of each output DataFrame partition in terms of total (uncompressed) parquet storage space. This argument is currently used to set the default value of split_row_groups (using row-group metadata from a single file), and will be ignored if split_row_groups is not set to ‘infer’ or ‘adaptive’. Default is 256 MiB.

  • aggregate_files (bool or str, default None) –

    WARNING: Passing a string argument to aggregate_files will result in experimental behavior. This behavior may change in the future.

    Whether distinct file paths may be aggregated into the same output partition. This parameter is only used when split_row_groups is set to ‘infer’, ‘adaptive’ or to an integer >1. A setting of True means that any two file paths may be aggregated into the same output partition, while False means that inter-file aggregation is prohibited.

    For “hive-partitioned” datasets, a “partition”-column name can also be specified. In this case, we allow the aggregation of any two files sharing a file path up to, and including, the corresponding directory name. For example, if aggregate_files is set to "section" for the directory structure below, 03.parquet and 04.parquet may be aggregated together, but 01.parquet and 02.parquet cannot be. If, however, aggregate_files is set to "region", 01.parquet may be aggregated with 02.parquet, and 03.parquet may be aggregated with 04.parquet:

    dataset-path/
    ├── region=1/
    │   ├── section=a/
    │   │   └── 01.parquet
    │   ├── section=b/
    │   └── └── 02.parquet
    └── region=2/
        ├── section=a/
        │   ├── 03.parquet
        └── └── 04.parquet
    

    Note that the default behavior of aggregate_files is False.

  • parquet_file_extension (str, tuple[str], or None, default (".parq", ".parquet", ".pq")) –

    A file extension or an iterable of extensions to use when discovering parquet files in a directory. Files that don’t match these extensions will be ignored. This argument only applies when paths corresponds to a directory and no _metadata file is present (or ignore_metadata_file=True). Passing in parquet_file_extension=None will treat all files in the directory as parquet files.

    The purpose of this argument is to ensure that the engine will ignore unsupported metadata files (like Spark’s ‘_SUCCESS’ and ‘crc’ files). It may be necessary to change this argument if the data files in your parquet dataset do not end in “.parq”, “.parquet”, or “.pq”.

  • filesystem ("fsspec", "arrow", or fsspec.AbstractFileSystem backend to use.) – Specifies the backend to use.

  • dataset (dict, default None) –

    Dictionary of options to use when creating a pyarrow.dataset.Dataset object. These options may include a “filesystem” key to configure the desired file-system backend. However, the top-level filesystem argument will always take precedence.

    Note: The dataset options may include a “partitioning” key. However, since pyarrow.dataset.Partitioning objects cannot be serialized, the value can be a dict of key-word arguments for the pyarrow.dataset.partitioning API (e.g. dataset={"partitioning": {"flavor": "hive", "schema": ...}}). Note that partitioned columns will not be converted to categorical dtypes when a custom partitioning schema is specified in this way.

  • read (dict, default None) – Dictionary of options to pass through to engine.read_partitions using the read key-word argument.

  • arrow_to_pandas (dict, default None) – Dictionary of options to use when converting from pyarrow.Table to a pandas DataFrame object. Only used by the “arrow” engine.

  • **kwargs (dict (of dicts)) – Options to pass through to engine.read_partitions as stand-alone key-word arguments. Note that these options will be ignored by the engines defined in dask.dataframe, but may be used by other custom implementations.

Examples

>>> df = dd.read_parquet('s3://bucket/my-parquet-data')  
count_nested(df, nested, by=None, join=True) lsdb.nested.core.NestedFrame[source]#

Counts the number of rows of a nested dataframe.

Wraps Nested-Pandas count_nested.

Parameters:
  • df (NestedFrame) – A NestedFrame that contains the desired nested series to count.

  • nested ('str') – The label of the nested series to count.

  • by ('str', optional) – Specifies a column within nested to count by, returning a count for each unique value in by.

  • join (bool, optional) – Join the output count columns to df and return df, otherwise just return a NestedFrame containing only the count columns.

Return type:

NestedFrame