lsdb.nested#
Submodules#
Classes#
An extension for a Dask Dataframe that has Nested-Pandas functionality. |
Functions#
|
Generates a toy catalog. |
|
Generates a toy dataset. |
|
Read a Parquet file into a Dask DataFrame |
|
Counts the number of rows of a nested dataframe. |
Package Contents#
- class NestedFrame(expr)[source]#
Bases:
_Frame
,dask.dataframe.DataFrame
An extension for a Dask Dataframe that has Nested-Pandas functionality.
Examples
>>> import lsdb.nested as nd >>> base = nd.NestedFrame(base_data) >>> layer = nd.NestedFrame(layer_data) >>> base.add_nested(layer, "layer")
- _partition_type#
- classmethod from_pandas(data, npartitions=None, chunksize=None, sort=True) NestedFrame [source]#
Returns an LSDB.nested NestedFrame constructed from a Nested-Pandas NestedFrame or Pandas DataFrame.
- Parameters:
data (NestedFrame or DataFrame) – Nested-Pandas NestedFrame containing the underlying data
npartitions (int, optional) – The number of partitions of the index to create. Note that depending on the size and index of the dataframe, the output may have fewer partitions than requested.
chunksize (int, optional) – The desired number of rows per index partition to use. Note that depending on the size and index of the dataframe, actual partition sizes may vary.
sort (bool, optional) – Whether to sort the frame by a default index.
- Returns:
result – The constructed Dask-Nested NestedFrame object.
- Return type:
NestedFrame
- classmethod from_dask_dataframe(df: dask.dataframe.DataFrame) NestedFrame [source]#
Converts a Dask Dataframe to a Dask-Nested NestedFrame
- Parameters:
df – A Dask Dataframe to convert
- Return type:
lsdb.nested.NestedFrame
- classmethod from_delayed(dfs, meta=None, divisions=None, prefix='from-delayed', verify_meta=True)[source]#
Create LSDB.nested NestedFrames from many Dask Delayed objects.
Docstring is copied from dask.dataframe.from_delayed.
- Parameters:
dfs – A
dask.delayed.Delayed
, adistributed.Future
, or an iterable of either of these objects, e.g. returned byclient.submit
. These comprise the individual partitions of the resulting dataframe. If a single object is provided (not an iterable), then the resulting dataframe will have only one partition.meta – An empty NestedFrame, pd.DataFrame, or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided (note that the order of the names should match the order of the columns). Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.
divisions – Partition boundaries along the index. For tuple, see https://docs.dask.org/en/latest/dataframe-design.html#partitions For string ‘sorted’ will compute the delayed values to find index values. Assumes that the indexes are mutually sorted. If None, then won’t use index information
prefix – Prefix to prepend to the keys.
verify_meta – If True check that the partitions have consistent metadata, defaults to True.
- classmethod from_map(func, *iterables, args=None, meta=None, divisions=None, label=None, enforce_metadata=True, **kwargs)[source]#
Create a DataFrame collection from a custom function map
WARNING: The
from_map
API is experimental, and stability is not yet guaranteed. Use at your own risk!- Parameters:
func (callable) – Function used to create each partition. If
func
satisfies theDataFrameIOFunction
protocol, column projection will be enabled.*iterables (Iterable objects) – Iterable objects to map to each output partition. All iterables must be the same length. This length determines the number of partitions in the output collection (only one element of each iterable will be passed to
func
for each partition).args (list or tuple, optional) – Positional arguments to broadcast to each output partition. Note that these arguments will always be passed to
func
after theiterables
positional arguments.meta – An empty NestedFrame, pd.DataFrame, or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided (note that the order of the names should match the order of the columns). Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.
divisions (tuple, str, optional) – Partition boundaries along the index. For tuple, see https://docs.dask.org/en/latest/dataframe-design.html#partitions For string ‘sorted’ will compute the delayed values to find index values. Assumes that the indexes are mutually sorted. If None, then won’t use index information
label (str, optional) – String to use as the function-name label in the output collection-key names.
enforce_metadata (bool, default True) – Whether to enforce at runtime that the structure of the DataFrame produced by
func
actually matches the structure ofmeta
. This will rename and reorder columns for each partition, and will raise an error if this doesn’t work, but it won’t raise if dtypes don’t match.**kwargs – Key-word arguments to broadcast to each output partition. These same arguments will be passed to
func
for every output partition.
- classmethod from_flat(df, base_columns, nested_columns=None, on=None, name='nested')[source]#
Creates a NestedFrame with base and nested columns from a flat dataframe.
- Parameters:
df (dd.DataFrame or nd.NestedFrame) – A flat dataframe.
base_columns (list-like) – The columns that should be used as base (flat) columns in the output dataframe.
nested_columns (list-like, or None) – The columns that should be packed into a nested column. All columns in the list will attempt to be packed into a single nested column with the name provided in nested_name. If None, is defined as all columns not in base_columns.
on (str or None) – The name of a column to use as the new index. Typically, the index should have a unique value per row for base columns, and should repeat for nested columns. For example, a dataframe with two columns; a=[1,1,1,2,2,2] and b=[5,10,15,20,25,30] would want an index like [0,0,0,1,1,1] if a is chosen as a base column. If not provided the current index will be used.
name – The name of the output column the nested_columns are packed into.
- Returns:
A NestedFrame with the specified nesting structure.
- Return type:
- classmethod from_lists(df, base_columns=None, list_columns=None, name='nested')[source]#
Creates a NestedFrame with base and nested columns from a flat dataframe.
- Parameters:
df (dd.DataFrame or nd.NestedFrame) – A dataframe with list columns.
base_columns (list-like, or None) – Any columns that have non-list values in the input df. These will simply be kept as identical columns in the result
list_columns (list-like, or None) – The list-value columns that should be packed into a nested column. All columns in the list will attempt to be packed into a single nested column with the name provided in nested_name. All columns in list_columns must have pyarrow list dtypes, otherwise the operation will fail. If None, is defined as all columns not in base_columns.
name – The name of the output column the nested_columns are packed into.
- Returns:
A NestedFrame with the specified nesting structure.
- Return type:
Note
As noted above, all columns in list_columns must have a pyarrow ListType dtype. This is needed for proper meta propagation. To convert a list column to this dtype, you can use this command structure: nf= nf.astype({“colname”: pd.ArrowDtype(pa.list_(pa.int64()))})
Where pa.int64 above should be replaced with the correct dtype of the underlying data accordingly.
Additionally, it’s a known issue in Dask (dask/dask#10139) that columns with list values will by default be converted to the string type. This will interfere with the ability to recast these to pyarrow lists. We recommend setting the following dask config setting to prevent this: dask.config.set({“dataframe.convert-string”:False})
- compute(**kwargs)[source]#
Compute this Dask collection, returning the underlying dataframe or series.
- property all_columns: dict#
returns a dictionary of columns for each base/nested dataframe
- property nested_columns: list#
retrieves the base column names for all nested dataframes
- _is_known_hierarchical_column(colname) bool [source]#
Determine whether a string is a known hierarchical column name
- add_nested(nested, name, how='outer') NestedFrame [source]#
Packs a dataframe into a nested column
- Parameters:
nested – A flat dataframe to pack into a nested column
name – The name given to the nested column
how ({‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’}, default ‘outer’) –
How to handle the operation of the two objects.
left: use calling frame’s index (or column if on is specified)
right: use other’s index.
outer: form union of calling frame’s index (or column if on is
specified) with other’s index, and sort it lexicographically.
inner: form intersection of calling frame’s index (or column if
on is specified) with other’s index, preserving the order of the calling’s one.
cross: creates the cartesian product from both frames, preserves
the order of the left keys.
- Return type:
lsdb.nested.NestedFrame
- query(expr) Self [source]#
Query the columns of a NestedFrame with a boolean expression. Specified queries can target nested columns in addition to the typical column set
Docstring copied from nested-pandas query
- Parameters:
expr (str) –
The query string to evaluate.
Access nested columns using nested_df.nested_col (where nested_df refers to a particular nested dataframe and nested_col is a column of that nested dataframe).
You can refer to variables in the environment by prefixing them with an ‘@’ character like
@a + b
.You can refer to column names that are not valid Python variable names by surrounding them in backticks. Thus, column names containing spaces or punctuations (besides underscores) or starting with digits must be surrounded by backticks. (For example, a column named “Area (cm^2)” would be referenced as
`Area (cm^2)`
). Column names which are Python keywords (like “list”, “for”, “import”, etc) cannot be used.For example, if one of your columns is called
a a
and you want to sum it withb
, your query should be`a a` + b
.- Returns:
DataFrame resulting from the provided query expression.
- Return type:
DataFrame
Notes
Queries that target a particular nested structure return a dataframe with rows of that particular nested structure filtered. For example, querying the NestedFrame “df” with nested structure “my_nested” as below will return all rows of df, but with mynested filtered by the condition:
>>> df.query("mynested.a > 2")
- dropna(*, axis: pandas._typing.Axis = 0, how: str | pandas._libs.lib.NoDefault = no_default, thresh: int | pandas._libs.lib.NoDefault = no_default, on_nested: bool = False, subset: pandas._typing.IndexLabel | None = None, inplace: bool = False, ignore_index: bool = False) Self [source]#
Remove missing values for one layer of the NestedFrame.
- Parameters:
axis ({0 or 'index', 1 or 'columns'}, default 0) –
Determine if rows or columns which contain missing values are removed.
0, or ‘index’ : Drop rows which contain missing values.
1, or ‘columns’ : Drop columns which contain missing value.
Only a single axis is allowed.
how ({'any', 'all'}, default 'any') –
Determine if row or column is removed from DataFrame, when we have at least one NA or all NA.
’any’ : If any NA values are present, drop that row or column.
’all’ : If all values are NA, drop that row or column.
thresh (int, optional) – Require that many non-NA values. Cannot be combined with how.
on_nested (str or bool, optional) – If not False, applies the call to the nested dataframe in the column with label equal to the provided string. If specified, the nested dataframe should align with any columns given in subset.
subset (column label or sequence of labels, optional) –
Labels along other axis to consider, e.g. if you are dropping rows these would be a list of columns to include.
Access nested columns using nested_df.nested_col (where nested_df refers to a particular nested dataframe and nested_col is a column of that nested dataframe).
inplace (bool, default False) – Whether to modify the DataFrame rather than creating a new one.
ignore_index (bool, default
False
) –If
True
, the resulting axis will be labeled 0, 1, …, n - 1.Added in version 2.0.0.
- Returns:
DataFrame with NA entries dropped from it or None if
inplace=True
.- Return type:
DataFrame or None
Notes
Operations that target a particular nested structure return a dataframe with rows of that particular nested structure affected.
Values for on_nested and subset should be consistent in pointing to a single layer, multi-layer operations are not supported at this time.
- sort_values(by: str | list[str], npartitions: int | None = None, ascending: bool | list[bool] = True, na_position: Literal['first'] | Literal['last'] = 'last', partition_size: float = 128000000.0, sort_function: collections.abc.Callable[[pandas.DataFrame], pandas.DataFrame] | None = None, sort_function_kwargs: collections.abc.Mapping[str, Any] | None = None, upsample: float = 1.0, ignore_index: bool | None = False, shuffle_method: str | None = None, **options) Self [source]#
Sort the dataset by a single column.
Sorting a parallel dataset requires expensive shuffles and is generally not recommended. See ‘set_index‘ for implementation details.
Parameters:#
- by: str or list[str]
Column(s) to sort by.
- npartitions: int, None, or ‘auto’
The ideal number of output partitions. If None, use the same as the input. If ‘auto’ then decide by memory use. Not used when sorting nested layers.
- ascending: bool or list[bool], optional
Sort ascending vs. descending. Defaults to True. Specify list for multiple sort orders. If this is a list of bools, must match the length of the by.
- na_position: {‘last’, ‘first’}, optional
Puts NaNs at the beginning if ‘first’, puts NaN at the end if ‘last’. Defaults to ‘last’.
- partition_size: float, optional
The desired size of each partition in bytes. Defaults to 128e6 (128 MB). Not used in nested sorting.
- sort_function: function, optional
Sorting function to use when sorting underlying partitions. If None, defaults to M.sort_values (the partition library’s implementation of sort_values). Not used when sorting nested layers.
- sort_function_kwargs: dict, optional
Additional keyword arguments to pass to the partition sorting function. By default, by, ascending, and na_position are provided.
- upsample: float, optional
Used to increase the number of samples for quantiles. Not used in nested sorting
- ignore_index: bool, optional
If True, the resulting axis will be labeled 0, 1, …, n - 1. Defaults to False.
- shuffle_method: str, optional
The method to use for shuffling data. Defaults to None. Not used in nested sorting
- **options: keyword arguments, optional
Additional options to pass to the sorting function.
Returns:#
- DataFrame
DataFrame with sorted values.
- reduce(func, *args, meta=dsk_no_default, infer_nesting=True, **kwargs) NestedFrame [source]#
Takes a function and applies it to each top-level row of the NestedFrame.
docstring copied from nested-pandas
The user may specify which columns the function is applied to, with columns from the ‘base’ layer being passsed to the function as scalars and columns from the nested layers being passed as numpy arrays.
- Parameters:
func (callable) – Function to apply to each nested dataframe. The first arguments to func should be which columns to apply the function to. See the Notes for recommendations on writing func outputs.
args (positional arguments) – A list of string column names to pull from the NestedFrame to pass along to the function. If the function has additional arguments, pass them as keyword arguments (e.g. arg_name=value)
meta (dataframe or series-like, optional) – The dask meta of the output. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended.
infer_nesting (bool, default True) – If True, the function will pack output columns into nested structures based on column names adhering to a nested naming scheme. E.g. “nested.b” and “nested.c” will be packed into a column called “nested” with columns “b” and “c”. If False, all outputs will be returned as base columns.
kwargs (keyword arguments, optional) – Keyword arguments to pass to the function.
- Returns:
NestedFrame with the results of the function applied to the columns of the frame.
- Return type:
NestedFrame
Notes
By default, reduce will produce a NestedFrame with enumerated column names for each returned value of the function. For more useful naming, it’s recommended to have func return a dictionary where each key is an output column of the dataframe returned by reduce.
Example User Function:
>>> def my_sum(col1, col2): >>> '''reduce will return a NestedFrame with two columns''' >>> return {"sum_col1": sum(col1), "sum_col2": sum(col2)}
When using nesting inference (infer_nesting=True), the output may contain nested columns. In such cases, the meta should be provided with the appropriate dtype for these columns. For example, the following function, which produces a nested column “lc”:
>>> def complex_output(flux): >>> return {"max_flux": np.max(flux), >>> "lc.flux_quantiles": np.quantile(flux, [0.1, 0.2, 0.3, 0.4, 0.5]), >>> "lc.labels": [0.1, 0.2, 0.3, 0.4, 0.5]}
Would require the following meta:
>>> # create a NestedDtype for the nested column "lc" >>> from nested_pandas.series.dtype import NestedDtype >>> lc_dtype = NestedDtype(pa.struct([pa.field("flux_quantiles", >>> pa.list_(pa.float64())), >>> pa.field("labels", pa.list_(pa.float64()))])) >>> # use the lc_dtype in meta creation >>> result_meta = npd.NestedFrame({'max_flux':pd.Series([], dtype='float'), >>> 'lc':pd.Series([], dtype=lc_dtype)})
- to_parquet(path, by_layer=False, **kwargs) None [source]#
Creates parquet file(s) with the data of a NestedFrame, either as a single parquet file directory where each nested dataset is packed into its own column or as an individual parquet file directory for each layer.
Docstring copied from nested-pandas.
Note that here we always opt to use the pyarrow engine for writing parquet files.
- Parameters:
path (str) – The path to the parquet directory to be written.
by_layer (bool, default True) –
NOTE: by_layer=False will not reliably preserve divisions currently, be warned when using it that loading from such a dataset will likely require you to reset and set the index to generate divisions information.
If False, writes the entire NestedFrame to a single parquet directory.
If True, writes each layer to a separate parquet sub-directory within the directory specified by path. The filename for each outputted file will be named after its layer. For example for the base layer this is always “base”.
kwargs (keyword arguments, optional) – Keyword arguments to pass to the function.
- Return type:
None
- generate_catalog(n_base, n_layer, seed=None, ra_range=(0.0, 360.0), dec_range=(-90, 90), search_region=None, **kwargs)[source]#
Generates a toy catalog.
- Parameters:
n_base (int) – The number of rows to generate for the base layer
n_layer (int, or dict) – The number of rows per n_base row to generate for a nested layer. Alternatively, a dictionary of layer label, layer_size pairs may be specified to created multiple nested columns with custom sizing.
seed (int) – A seed to use for random generation of data
ra_range (tuple) – A tuple of the min and max values for the ra column in degrees
dec_range (tuple) – A tuple of the min and max values for the dec column in degrees
search_region (AbstractSearch) – A search region to apply to the generated data. Currently supports the ConeSearch and BoxSearch regions. Note that if provided, this will override the ra_range and dec_range parameters.
**kwargs – Additional keyword arguments to pass to lsdb.from_dataframe.
- Returns:
The constructed LSDB Catalog.
- Return type:
Examples
>>> from lsdb.nested.datasets import generate_catalog >>> gen_cat = generate_catalog(10,100) >>> gen_cat = generate_catalog(1000, 10, ra_range=(0.,10.), dec_range=(-5.,0.))
Constraining spatial ranges: >>> gen_cat = generate_data(10, 100, ra_range=(0., 10.), dec_range=(-5., 0.))
Using a search region: >>> from lsdb.core.search import ConeSearch # doctest: +SKIP >>> gen_cat = generate_data(10, 100, search_region=ConeSearch(5, 5, 1))
- generate_data(n_base, n_layer, npartitions=1, seed=None, ra_range=(0.0, 360.0), dec_range=(-90, 90), search_region=None)[source]#
Generates a toy dataset.
Docstring copied from nested-pandas.
- Parameters:
n_base (int) – The number of rows to generate for the base layer
n_layer (int, or dict) – The number of rows per n_base row to generate for a nested layer. Alternatively, a dictionary of layer label, layer_size pairs may be specified to created multiple nested columns with custom sizing.
npartitions (int) – The number of partitions to split the data into.
seed (int) – A seed to use for random generation of data
ra_range (tuple) – A tuple of the min and max values for the ra column in degrees
dec_range (tuple) – A tuple of the min and max values for the dec column in degrees
search_region (AbstractSearch) – A search region to apply to the generated data. Currently supports the ConeSearch, BoxSearch, and PixelSearch regions. Note that if provided, this will override the ra_range and dec_range parameters.
- Returns:
The constructed Dask NestedFrame.
- Return type:
Examples
>>> from lsdb.nested.datasets import generate_data >>> nf = generate_data(10,100) >>> nf = generate_data(10, {"nested_a": 100, "nested_b": 200})
Constraining spatial ranges: >>> nf = generate_data(10, 100, ra_range=(0., 10.), dec_range=(-5., 0.))
Using a search region: >>> from lsdb.core.search import ConeSearch >>> nf = generate_data(10, 100, search_region=ConeSearch(5, 5, 1))
- read_parquet(path, columns=None, filters=None, categories=None, index=None, storage_options=None, engine='auto', use_nullable_dtypes: bool | None = None, dtype_backend=None, calculate_divisions=None, ignore_metadata_file=False, metadata_task_size=None, split_row_groups='infer', blocksize='default', aggregate_files=None, parquet_file_extension=('.parq', '.parquet', '.pq'), filesystem=None, **kwargs) lsdb.nested.core.NestedFrame [source]#
Read a Parquet file into a Dask DataFrame
This reads a directory of Parquet data into a Dask.dataframe, one file per partition. It selects the index among the sorted columns if any exist.
Docstring copied from dask.dataframe.read_parquet
- Parameters:
path (str or list) – Source directory for data, or path(s) to individual parquet files. Prefix with a protocol like
s3://
to read from alternative filesystems. To read from multiple files you can pass a globstring or a list of paths, with the caveat that they must all have the same protocol.columns (str or list, default None) – Field name(s) to read in as columns in the output. By default all non-index fields will be read (as determined by the pandas parquet metadata, if present). Provide a single field name instead of a list to read in the data as a Series.
filters (Union[List[Tuple[str, str, Any]], List[List[Tuple[str, str, Any]]]], default None) –
List of filters to apply, like
[[('col1', '==', 0), ...], ...]
. Using this argument will result in row-wise filtering of the final partitions.Predicates can be expressed in disjunctive normal form (DNF). This means that the inner-most tuple describes a single column predicate. These inner predicates are combined with an AND conjunction into a larger predicate. The outer-most list then combines all of the combined filters with an OR disjunction.
Predicates can also be expressed as a
List[Tuple]
. These are evaluated as an AND conjunction. To express OR in predicates, one must use the (preferred for “pyarrow”)List[List[Tuple]]
notation.index (str, list or False, default None) – Field name(s) to use as the output frame index. By default will be inferred from the pandas parquet file metadata, if present. Use
False
to read all fields as columns.categories (list or dict, default None) – For any fields listed here, if the parquet encoding is Dictionary, the column will be created with dtype category. Use only if it is guaranteed that the column is encoded as dictionary in all row-groups. If a list, assumes up to 2**16-1 labels; if a dict, specify the number of labels expected; if None, will load categories automatically for data written by dask, not otherwise.
storage_options (dict, default None) – Key/value pairs to be passed on to the file-system backend, if any. Note that the default file-system backend can be configured with the
filesystem
argument, described below.open_file_options (dict, default None) – Key/value arguments to be passed along to
AbstractFileSystem.open
when each parquet data file is open for reading. Experimental (optimized) “precaching” for remote file systems (e.g. S3, GCS) can be enabled by adding{"method": "parquet"}
under the"precache_options"
key. Also, a custom file-open function can be used (instead ofAbstractFileSystem.open
), by specifying the desired function under the"open_file_func"
key.engine ({'auto', 'pyarrow'}) – Parquet library to use. Defaults to ‘auto’, which uses
pyarrow
if it is installed, and falls back to the deprecatedfastparquet
otherwise. Note thatfastparquet
does not support all functionality offered bypyarrow
. This is also used by third-party packages (e.g. CuDF) to inject bespoke engines.use_nullable_dtypes ({False, True}) –
Whether to use extension dtypes for the resulting
DataFrame
.Note
This option is deprecated. Use “dtype_backend” instead.
dtype_backend ({'numpy_nullable', 'pyarrow'}, defaults to NumPy backed DataFrames) – Which dtype_backend to use, e.g. whether a DataFrame should have NumPy arrays, nullable dtypes are used for all dtypes that have a nullable implementation when ‘numpy_nullable’ is set, pyarrow is used for all dtypes if ‘pyarrow’ is set.
dtype_backend="pyarrow"
requirespandas
1.5+.calculate_divisions (bool, default False) – Whether to use min/max statistics from the footer metadata (or global
_metadata
file) to calculate divisions for the output DataFrame collection. Divisions will not be calculated if statistics are missing. This option will be ignored ifindex
is not specified and there is no physical index column specified in the custom “pandas” Parquet metadata. Note thatcalculate_divisions=True
may be extremely slow when no global_metadata
file is present, especially when reading from remote storage. Set this toTrue
only when known divisions are needed for your workload (see dataframe-design-partitions).ignore_metadata_file (bool, default False) – Whether to ignore the global
_metadata
file (when one is present). IfTrue
, or if the global_metadata
file is missing, the parquet metadata may be gathered and processed in parallel. Parallel metadata processing is currently supported forArrowDatasetEngine
only.metadata_task_size (int, default configurable) – If parquet metadata is processed in parallel (see
ignore_metadata_file
description above), this argument can be used to specify the number of dataset files to be processed by each task in the Dask graph. If this argument is set to0
, parallel metadata processing will be disabled. The default values for local and remote filesystems can be specified with the “metadata-task-size-local” and “metadata-task-size-remote” config fields, respectively (see “dataframe.parquet”).split_row_groups ('infer', 'adaptive', bool, or int, default 'infer') – If True, then each output dataframe partition will correspond to a single parquet-file row-group. If False, each partition will correspond to a complete file. If a positive integer value is given, each dataframe partition will correspond to that number of parquet row-groups (or fewer). If ‘adaptive’, the metadata of each file will be used to ensure that every partition satisfies
blocksize
. If ‘infer’ (the default), the uncompressed storage-size metadata in the first file will be used to automatically setsplit_row_groups
to either ‘adaptive’ orFalse
.blocksize (int or str, default 'default') – The desired size of each output
DataFrame
partition in terms of total (uncompressed) parquet storage space. This argument is currently used to set the default value ofsplit_row_groups
(using row-group metadata from a single file), and will be ignored ifsplit_row_groups
is not set to ‘infer’ or ‘adaptive’. Default is 256 MiB.aggregate_files (bool or str, default None) –
WARNING: Passing a string argument to
aggregate_files
will result in experimental behavior. This behavior may change in the future.Whether distinct file paths may be aggregated into the same output partition. This parameter is only used when split_row_groups is set to ‘infer’, ‘adaptive’ or to an integer >1. A setting of True means that any two file paths may be aggregated into the same output partition, while False means that inter-file aggregation is prohibited.
For “hive-partitioned” datasets, a “partition”-column name can also be specified. In this case, we allow the aggregation of any two files sharing a file path up to, and including, the corresponding directory name. For example, if
aggregate_files
is set to"section"
for the directory structure below,03.parquet
and04.parquet
may be aggregated together, but01.parquet
and02.parquet
cannot be. If, however,aggregate_files
is set to"region"
,01.parquet
may be aggregated with02.parquet
, and03.parquet
may be aggregated with04.parquet
:dataset-path/ ├── region=1/ │ ├── section=a/ │ │ └── 01.parquet │ ├── section=b/ │ └── └── 02.parquet └── region=2/ ├── section=a/ │ ├── 03.parquet └── └── 04.parquet
Note that the default behavior of
aggregate_files
isFalse
.parquet_file_extension (str, tuple[str], or None, default (".parq", ".parquet", ".pq")) –
A file extension or an iterable of extensions to use when discovering parquet files in a directory. Files that don’t match these extensions will be ignored. This argument only applies when
paths
corresponds to a directory and no_metadata
file is present (orignore_metadata_file=True
). Passing inparquet_file_extension=None
will treat all files in the directory as parquet files.The purpose of this argument is to ensure that the engine will ignore unsupported metadata files (like Spark’s ‘_SUCCESS’ and ‘crc’ files). It may be necessary to change this argument if the data files in your parquet dataset do not end in “.parq”, “.parquet”, or “.pq”.
filesystem ("fsspec", "arrow", or fsspec.AbstractFileSystem backend to use.) – Specifies the backend to use.
dataset (dict, default None) –
Dictionary of options to use when creating a
pyarrow.dataset.Dataset
object. These options may include a “filesystem” key to configure the desired file-system backend. However, the top-levelfilesystem
argument will always take precedence.Note: The
dataset
options may include a “partitioning” key. However, sincepyarrow.dataset.Partitioning
objects cannot be serialized, the value can be a dict of key-word arguments for thepyarrow.dataset.partitioning
API (e.g.dataset={"partitioning": {"flavor": "hive", "schema": ...}}
). Note that partitioned columns will not be converted to categorical dtypes when a custom partitioning schema is specified in this way.read (dict, default None) – Dictionary of options to pass through to
engine.read_partitions
using theread
key-word argument.arrow_to_pandas (dict, default None) – Dictionary of options to use when converting from
pyarrow.Table
to a pandasDataFrame
object. Only used by the “arrow” engine.**kwargs (dict (of dicts)) – Options to pass through to
engine.read_partitions
as stand-alone key-word arguments. Note that these options will be ignored by the engines defined indask.dataframe
, but may be used by other custom implementations.
Examples
>>> df = dd.read_parquet('s3://bucket/my-parquet-data')
- count_nested(df, nested, by=None, join=True) lsdb.nested.core.NestedFrame [source]#
Counts the number of rows of a nested dataframe.
Wraps Nested-Pandas count_nested.
- Parameters:
df (NestedFrame) – A NestedFrame that contains the desired nested series to count.
nested ('str') – The label of the nested series to count.
by ('str', optional) – Specifies a column within nested to count by, returning a count for each unique value in by.
join (bool, optional) – Join the output count columns to df and return df, otherwise just return a NestedFrame containing only the count columns.
- Return type: