Small-Scale Analysis

Small-Scale Analysis#

In this tutorial, we will cover strategies for working with a small slice of a large catalog before committing to a full-scale computation:

narrow the sky area with a spatial filter (e.g., cone_search)
inspect a single partition with .partitions[i]
filter rows to a manageable subset (e.g., bright stars)
peek at the first few rows of every partition with map_partitions
draw a random sample with .random_sample()

Introduction#

Large astronomical catalogs can contain billions of rows spread across thousands of partitions. Running a pipeline on the full dataset is expensive, and it is easy to waste hours on a bug that could have been caught in seconds on a small slice.

A good workflow starts small:

Narrow the sky: work only in the patch of sky you actually care about.
Inspect one partition: confirm the data looks right before processing everything.
Filter aggressively: drop rows you do not need as early as possible.
Peek at multiple partitions: cheaply verify your function behaves correctly across partition boundaries.
Draw a random sample: get a statistically representative preview without a full compute.

Each technique in this tutorial reduces the amount of data you touch, so you can iterate quickly and scale up only once you are confident in the result.

[1]:

import pandas as pd

import lsdb
from dask.distributed import Client

1. Open a catalog#

Additional Help

For additional information on dask client creation, please refer to the official Dask documentation and our Dask cluster configuration page for LSDB-specific tips. Note that dask also provides its own best practices, which may also be useful to consult.

For tips on accessing remote data, see our Accessing remote data guide

[2]:

client = Client(n_workers=4, memory_limit="auto")

We open the Pan-STARRS1 (PS1) object catalog. The catalog is loaded lazily–no row data is read yet.

[3]:

ps1_object = lsdb.open_catalog("s3://stpubdata/panstarrs/ps1/public/hats/otmo")
ps1_object

[3]:

lsdb Catalog otmo:

	decMean	decMeanErr	epochMean	gFlags	gMeanPSFMag	gMeanPSFMagErr	iFlags	iMeanPSFMag	iMeanPSFMagErr	nDetections	ng	ni	nr	ny	nz	objID	objInfoFlag	qualityFlag	raMean	raMeanErr	rFlags	rMeanPSFMag	rMeanPSFMagErr	surveyID	yFlags	yMeanPSFMag	yMeanPSFMagErr	zFlags	zMeanPSFMag	zMeanPSFMagErr
npartitions=9577
Order: 5, Pixel: 0	double[pyarrow]	double[pyarrow]	double[pyarrow]	int32[pyarrow]	double[pyarrow]	double[pyarrow]	int32[pyarrow]	double[pyarrow]	double[pyarrow]	int16[pyarrow]	int16[pyarrow]	int16[pyarrow]	int16[pyarrow]	int16[pyarrow]	int16[pyarrow]	int64[pyarrow]	int32[pyarrow]	int16[pyarrow]	double[pyarrow]	double[pyarrow]	int32[pyarrow]	double[pyarrow]	double[pyarrow]	int16[pyarrow]	int32[pyarrow]	double[pyarrow]	double[pyarrow]	int32[pyarrow]	double[pyarrow]	double[pyarrow]
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
Order: 5, Pixel: 12286	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
Order: 5, Pixel: 12287	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...

30 out of 131 available columns in the catalog have been loaded lazily, meaning no data has been read, only the catalog schema

This catalog has an estimated size of 1.9 TB

Note the number of columns shown.

Some catalogs come with a pre-specified set of “default columns” that will be loaded automatically (unless the columns='all' keyword is specified). We can also always manually specify which columns we’d like to load.

Let’s reduce the amount of data we’ll handle as we work with this catalog.

[4]:

ps1_object = lsdb.open_catalog(
    "s3://stpubdata/panstarrs/ps1/public/hats/otmo",
    columns=["objID", "raMean", "decMean", "gMeanPSFMag", "rMeanPSFMag", "iMeanPSFMag", "nDetections"],
)
ps1_object

[4]:

lsdb Catalog otmo:

	objID	raMean	decMean	gMeanPSFMag	rMeanPSFMag	iMeanPSFMag	nDetections
npartitions=9577
Order: 5, Pixel: 0	int64[pyarrow]	double[pyarrow]	double[pyarrow]	double[pyarrow]	double[pyarrow]	double[pyarrow]	int16[pyarrow]
...	...	...	...	...	...	...	...
Order: 5, Pixel: 12286	...	...	...	...	...	...	...
Order: 5, Pixel: 12287	...	...	...	...	...	...	...

7 out of 131 available columns in the catalog have been loaded lazily, meaning no data has been read, only the catalog schema

This catalog has an estimated size of 612.5 GB

2. Region selection#

The simplest way to reduce the amount of data you work with is to restrict the sky area. A cone_search keeps only the partitions that overlap a circle defined by a center (ra, dec) and radius_arcsec.

Starting with a small cone lets you develop and test your pipeline on a tiny fraction of the catalog. Once the pipeline is correct, you can widen the cone or remove it entirely.

[5]:

ps1_cone = ps1_object.cone_search(ra=200.0, dec=30.0, radius_arcsec=1 * 3600)
ps1_cone

[5]:

lsdb Catalog otmo:

	objID	raMean	decMean	gMeanPSFMag	rMeanPSFMag	iMeanPSFMag	nDetections
npartitions=4
Order: 5, Pixel: 2596	int64[pyarrow]	double[pyarrow]	double[pyarrow]	double[pyarrow]	double[pyarrow]	double[pyarrow]	int16[pyarrow]
Order: 5, Pixel: 2597	...	...	...	...	...	...	...
Order: 5, Pixel: 2598	...	...	...	...	...	...	...
Order: 5, Pixel: 2599	...	...	...	...	...	...	...

7 out of 131 available columns in the catalog have been loaded lazily, meaning no data has been read, only the catalog schema

This catalog has an estimated size of 255.8 MB

The npartitions has dropped from thousands to a handful, so every subsequent step is much cheaper.

We can also use a pre-built ConeSearch object, which lets us reuse the same region across multiple catalogs.

[6]:

from lsdb import ConeSearch

cone = ConeSearch(ra=200.0, dec=30.0, radius_arcsec=1 * 3600)
ps1_cone = ps1_object.search(cone)
ps1_cone

[6]:

lsdb Catalog otmo:

	objID	raMean	decMean	gMeanPSFMag	rMeanPSFMag	iMeanPSFMag	nDetections
npartitions=4
Order: 5, Pixel: 2596	int64[pyarrow]	double[pyarrow]	double[pyarrow]	double[pyarrow]	double[pyarrow]	double[pyarrow]	int16[pyarrow]
Order: 5, Pixel: 2597	...	...	...	...	...	...	...
Order: 5, Pixel: 2598	...	...	...	...	...	...	...
Order: 5, Pixel: 2599	...	...	...	...	...	...	...

7 out of 131 available columns in the catalog have been loaded lazily, meaning no data has been read, only the catalog schema

This catalog has an estimated size of 255.8 MB

3. Partition selection#

Even within a small region you may want to look at a single partition in isolation. Use .partitions[i] to index into the catalog by partition number. The result is a lazy Dask DataFrame for that one partition.

This is useful when you want to call .compute() on just one chunk to inspect values or test a function without touching the rest of the catalog.

[7]:

# Look at the first partition
first_partition = ps1_cone.partitions[0]
first_partition.compute()

[7]:

	objID	raMean	decMean	gMeanPSFMag	rMeanPSFMag	iMeanPSFMag	nDetections
_healpix_29
730840911935789259	142942005432773466	200.543267	29.119078	-999.0	-999.0	22.1049	1
730840958034592453	142942005448135918	200.544851	29.121114	-999.0	-999.0	-999.0	1
...	...	...	...	...	...	...	...
730990514047383426	143991996861207234	199.686118	29.99723	-999.0	-999.0	-999.0	1
730990514483634856	144001996877830045	199.687771	29.999566	-999.0	-999.0	22.3608	1

181808 rows × 7 columns

You can find the HEALPix pixel that corresponds to a given partition index using get_healpix_pixels().

[8]:

pixels = ps1_cone.get_healpix_pixels()

print(f"All pixels covered: {pixels}")
print(f"Partition 0 covers: {pixels[0]}")

All pixels covered: [Order: 5, Pixel: 2596, Order: 5, Pixel: 2597, Order: 5, Pixel: 2598, Order: 5, Pixel: 2599]
Partition 0 covers: Order: 5, Pixel: 2596

4. Sub-filtering#

Row filters let you trim the data further before any expensive computation. For example, selecting only bright stars reduces the number of rows dramatically and gives you a representative but manageable subset to work with.

[9]:

ps1_cone_and_bright = ps1_cone.query("0 < gMeanPSFMag < 16")
ps1_cone_and_bright

[9]:

lsdb Catalog otmo:

	objID	raMean	decMean	gMeanPSFMag	rMeanPSFMag	iMeanPSFMag	nDetections
npartitions=4
Order: 5, Pixel: 2596	int64[pyarrow]	double[pyarrow]	double[pyarrow]	double[pyarrow]	double[pyarrow]	double[pyarrow]	int16[pyarrow]
Order: 5, Pixel: 2597	...	...	...	...	...	...	...
Order: 5, Pixel: 2598	...	...	...	...	...	...	...
Order: 5, Pixel: 2599	...	...	...	...	...	...	...

7 out of 131 available columns in the catalog have been loaded lazily, meaning no data has been read, only the catalog schema

This catalog has an estimated size of 255.8 MB

[10]:

ps1_cone_and_bright.head(5)

[10]:

	objID	raMean	decMean	gMeanPSFMag	rMeanPSFMag	iMeanPSFMag	nDetections
_healpix_29
730844140320140186	142832001473917106	200.147265	29.030416	14.7094	14.0439	13.8106	55
730844900543284476	142842001903509791	200.190449	29.04093	15.1536	14.6864	14.5175	55
730845199124677148	142912002434617897	200.243435	29.097786	15.3492	14.9563	14.8328	69
730845991608593391	142852003372855574	200.337294	29.045833	15.9104	15.4282	15.2398	68
730846292154371814	142912004225284029	200.422788	29.09478	12.0536	11.207	10.9045	49

5 rows × 7 columns

Filters compose: you can chain a spatial filter with a row filter and LSDB will push both into the same pipeline.

5. Peeking at every partition#

map_partitions applies a function to each partition individually. Passing pd.DataFrame.head (or a small wrapper around it) is a cheap way to fetch the first few rows of every partition without loading the full catalog into memory.

This is especially useful for checking that a transformation produces the expected columns and values across all partition boundaries.

[11]:

# Grab the first 3 rows from each partition, then compute
sample_per_partition = ps1_cone.map_partitions(lambda df: df.head(3))
sample_per_partition.compute()

[11]:

	objID	raMean	decMean	gMeanPSFMag	rMeanPSFMag	iMeanPSFMag	nDetections
_healpix_29
730840911935789259	142942005432773466	200.543267	29.119078	-999.0	-999.0	22.1049	1
730840958034592453	142942005448135918	200.544851	29.121114	-999.0	-999.0	-999.0	1
730840958090282340	142942005456926339	200.545683	29.121461	-999.0	-999.0	-999.0	1
731028614286063577	142972005930353041	200.593027	29.143732	-999.0	21.5769	-999.0	1
731028614545892412	142972005940344460	200.594009	29.144912	22.1401	-999.0	-999.0	1
731028614842509175	142972005928714895	200.59283	29.145228	-999.0	21.4786	-999.0	2
731343459159795912	143281990771334860	199.077096	29.40356	-999.0	-999.0	-999.0	1
731343462332759277	143281990795017790	199.079454	29.406004	18.150801	-999.0	-999.0	1
731343462646587519	143281990763998560	199.076392	29.406652	-999.0	21.8297	-999.0	1
731553465279730385	144001996876174113	199.687558	30.002921	22.3853	21.153799	20.791201	50
731553465557381929	144001996902904284	199.690246	30.003063	21.1381	-999.0	-999.0	1
731553465612402991	144001996909224697	199.690881	30.00342	20.8773	-999.0	-999.0	1

12 rows × 7 columns

You can pass pd.DataFrame.head directly as the function, along with n as an extra keyword argument.

[12]:

sample_per_partition = ps1_cone.map_partitions(pd.DataFrame.head, n=3)
sample_per_partition.compute()

[12]:

	objID	raMean	decMean	gMeanPSFMag	rMeanPSFMag	iMeanPSFMag	nDetections
_healpix_29
730840911935789259	142942005432773466	200.543267	29.119078	-999.0	-999.0	22.1049	1
730840958034592453	142942005448135918	200.544851	29.121114	-999.0	-999.0	-999.0	1
730840958090282340	142942005456926339	200.545683	29.121461	-999.0	-999.0	-999.0	1
731028614286063577	142972005930353041	200.593027	29.143732	-999.0	21.5769	-999.0	1
731028614545892412	142972005940344460	200.594009	29.144912	22.1401	-999.0	-999.0	1
731028614842509175	142972005928714895	200.59283	29.145228	-999.0	21.4786	-999.0	2
731343459159795912	143281990771334860	199.077096	29.40356	-999.0	-999.0	-999.0	1
731343462332759277	143281990795017790	199.079454	29.406004	18.150801	-999.0	-999.0	1
731343462646587519	143281990763998560	199.076392	29.406652	-999.0	21.8297	-999.0	1
731553465279730385	144001996876174113	199.687558	30.002921	22.3853	21.153799	20.791201	50
731553465557381929	144001996902904284	199.690246	30.003063	21.1381	-999.0	-999.0	1
731553465612402991	144001996909224697	199.690881	30.00342	20.8773	-999.0	-999.0	1

12 rows × 7 columns

6. Random sample#

.random_sample(n) draws approximately n rows distributed proportionally across all partitions. Unlike .head(), which always returns rows from the first partitions, a random sample is representative of the whole catalog.

Use .random_sample() when you need a statistical cross-section of the data — for example, to estimate a distribution or spot-check the output of a filter.

Pass a seed for reproducible results.

[13]:

sample = ps1_cone.random_sample(n=20, seed=42)
sample

[13]:

	objID	raMean	decMean	gMeanPSFMag	rMeanPSFMag	iMeanPSFMag	nDetections
_healpix_29
730846834852851572	142962005381462924	200.538174	29.135284	-999.0	20.593599	-999.0	1
730987099608181596	143731996756349535	199.675627	29.782447	-999.0	21.301701	-999.0	1
730972800582093751	143531993104260117	199.3104	29.60791	-999.0	-999.0	-999.0	1
730845191136739853	142912002503771422	200.250355	29.092376	21.448799	-999.0	-999.0	1
730844262768288949	142842001284136813	200.128398	29.03857	-999.0	-999.0	-999.0	0
731033253320242313	143242008458133971	200.845798	29.369533	-999.0	22.0828	-999.0	1
731164861405496795	144192006407713307	200.640793	30.160596	-999.0	-999.0	21.563601	1
731052895685407195	143512008742236596	200.874272	29.596678	-999.0	-999.0	-999.0	1
731247326923041918	144812006378319344	200.637802	30.682334	-999.0	21.3312	-999.0	1
731199240090795328	144672005150456062	200.515049	30.562902	-999.0	-999.0	20.426001	1
731160483139110590	144002004724484111	200.472519	30.003003	21.132401	-999.0	-999.0	2
731164100175811611	144232008082666490	200.808296	30.196522	-999.0	-999.0	-999.0	1
731204073610000516	144182009959715130	200.996018	30.153758	-999.0	-999.0	16.9126	1
731369658435367847	143941991966921581	199.196698	29.950813	-999.0	-999.0	-999.0	1
731352796218232834	143621989370894194	198.937077	29.686341	-999.0	-999.0	-999.0	1
731352620672331015	143581989793747530	198.979361	29.655764	18.1252	-999.0	-999.0	1
731590636429043291	144661994195353054	199.419555	30.552076	-999.0	22.1066	-999.0	1
731567077974230502	144501997115684917	199.711568	30.420258	-999.0	-999.0	-999.0	1
731564414720532922	144491996135171955	199.613466	30.409478	-999.0	21.867201	-999.0	1
731592397818371499	144721993458371967	199.345856	30.601159	-999.0	-999.0	-999.0	1

20 rows × 7 columns

If you only want to sample from a single partition, use .sample(partition_id, n) instead. This avoids touching any other partition.

[14]:

single_partition_sample = ps1_cone.sample(partition_id=0, n=5, seed=42)
single_partition_sample

[14]:

	objID	raMean	decMean	gMeanPSFMag	rMeanPSFMag	iMeanPSFMag	nDetections
_healpix_29
730938408875144709	142872000179693610	200.017971	29.060879	-999.0	-999.0	-999.0	1
730948534679374151	143211999779607302	199.977993	29.347258	-999.0	-999.0	-999.0	1
730948735523264553	143071997742759545	199.774245	29.232474	21.907801	-999.0	-999.0	1
730967984242281525	143231991132008467	199.113143	29.364896	-999.0	-999.0	22.0298	2
730936578823873019	143031995950572695	199.595035	29.19342	-999.0	-999.0	21.2794	1

5 rows × 7 columns

Closing the Dask client#

[15]:

client.close()

About#

Authors: Olivia Lynn

Last updated on: May 18, 2026

If you use lsdb for published research, please cite following instructions.

Small-Scale Analysis

Contents

Small-Scale Analysis#

Introduction#

1. Open a catalog#

2. Region selection#

3. Partition selection#

4. Sub-filtering#

5. Peeking at every partition#

6. Random sample#

Closing the Dask client#

About#