Small-Scale Analysis

Small-Scale Analysis#

In this tutorial, we will cover strategies for working with a small slice of a large catalog before committing to a full-scale computation:

narrow the sky area with a spatial filter (e.g., cone_search)
inspect a single partition with .partitions[i]
filter rows to a manageable subset (e.g., bright stars)
peek at the first few rows of every partition with map_partitions
draw a random sample with .random_sample()

Introduction#

Large astronomical catalogs can contain billions of rows spread across thousands of partitions. Running a pipeline on the full dataset is expensive, and it is easy to waste hours on a bug that could have been caught in seconds on a small slice.

A good workflow starts small:

Narrow the sky: work only in the patch of sky you actually care about.
Inspect one partition: confirm the data looks right before processing everything.
Filter aggressively: drop rows you do not need as early as possible.
Peek at multiple partitions: cheaply verify your function behaves correctly across partition boundaries.
Draw a random sample: get a statistically representative preview without a full compute.

Each technique in this tutorial reduces the amount of data you touch, so you can iterate quickly and scale up only once you are confident in the result.

[1]:

import pandas as pd

import lsdb
from dask.distributed import Client

1. Open a catalog#

Additional Help

For additional information on dask client creation, please refer to the official Dask documentation and our Dask cluster configuration page for LSDB-specific tips. Note that dask also provides its own best practices, which may also be useful to consult.

For tips on accessing remote data, see our Accessing remote data guide

[2]:

client = Client(n_workers=4, memory_limit="auto")

We open the Pan-STARRS1 (PS1) object catalog. The catalog is loaded lazily–no row data is read yet.

[3]:

ps1_object = lsdb.open_catalog("s3://stpubdata/panstarrs/ps1/public/hats/otmo")
ps1_object

[3]:

lsdb Catalog otmo:

	decMean	decMeanErr	epochMean	gFlags	gMeanPSFMag	gMeanPSFMagErr	iFlags	iMeanPSFMag	iMeanPSFMagErr	nDetections	ng	ni	nr	ny	nz	objID	objInfoFlag	qualityFlag	raMean	raMeanErr	rFlags	rMeanPSFMag	rMeanPSFMagErr	surveyID	yFlags	yMeanPSFMag	yMeanPSFMagErr	zFlags	zMeanPSFMag	zMeanPSFMagErr
npartitions=9577
Order: 5, Pixel: 0	double[pyarrow]	double[pyarrow]	double[pyarrow]	int32[pyarrow]	double[pyarrow]	double[pyarrow]	int32[pyarrow]	double[pyarrow]	double[pyarrow]	int16[pyarrow]	int16[pyarrow]	int16[pyarrow]	int16[pyarrow]	int16[pyarrow]	int16[pyarrow]	int64[pyarrow]	int32[pyarrow]	int16[pyarrow]	double[pyarrow]	double[pyarrow]	int32[pyarrow]	double[pyarrow]	double[pyarrow]	int16[pyarrow]	int32[pyarrow]	double[pyarrow]	double[pyarrow]	int32[pyarrow]	double[pyarrow]	double[pyarrow]
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
Order: 5, Pixel: 12286	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
Order: 5, Pixel: 12287	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...

30 out of 131 available columns in the catalog have been loaded lazily, meaning no data has been read, only the catalog schema

This catalog has an estimated size of 1.9 TB

Note the number of columns shown.

Some catalogs come with a pre-specified set of “default columns” that will be loaded automatically (unless the columns='all' keyword is specified). We can also always manually specify which columns we’d like to load.

Let’s reduce the amount of data we’ll handle as we work with this catalog.

[4]:

ps1_object = lsdb.open_catalog(
    "s3://stpubdata/panstarrs/ps1/public/hats/otmo",
    columns=["objID", "raMean", "decMean", "gMeanPSFMag", "rMeanPSFMag", "iMeanPSFMag", "nDetections"],
)
ps1_object

[4]:

lsdb Catalog otmo:

	objID	raMean	decMean	gMeanPSFMag	rMeanPSFMag	iMeanPSFMag	nDetections
npartitions=9577
Order: 5, Pixel: 0	int64[pyarrow]	double[pyarrow]	double[pyarrow]	double[pyarrow]	double[pyarrow]	double[pyarrow]	int16[pyarrow]
...	...	...	...	...	...	...	...
Order: 5, Pixel: 12286	...	...	...	...	...	...	...
Order: 5, Pixel: 12287	...	...	...	...	...	...	...

7 out of 131 available columns in the catalog have been loaded lazily, meaning no data has been read, only the catalog schema

This catalog has an estimated size of 612.5 GB

2. Region selection#

The simplest way to reduce the amount of data you work with is to restrict the sky area. A cone_search keeps only the partitions that overlap a circle defined by a center (ra, dec) and radius_arcsec.

Starting with a small cone lets you develop and test your pipeline on a tiny fraction of the catalog. Once the pipeline is correct, you can widen the cone or remove it entirely.

[5]:

ps1_cone = ps1_object.cone_search(ra=200.0, dec=30.0, radius_arcsec=1 * 3600)
ps1_cone

[5]:

lsdb Catalog otmo:

	objID	raMean	decMean	gMeanPSFMag	rMeanPSFMag	iMeanPSFMag	nDetections
npartitions=4
Order: 5, Pixel: 2596	int64[pyarrow]	double[pyarrow]	double[pyarrow]	double[pyarrow]	double[pyarrow]	double[pyarrow]	int16[pyarrow]
Order: 5, Pixel: 2597	...	...	...	...	...	...	...
Order: 5, Pixel: 2598	...	...	...	...	...	...	...
Order: 5, Pixel: 2599	...	...	...	...	...	...	...

7 out of 131 available columns in the catalog have been loaded lazily, meaning no data has been read, only the catalog schema

This catalog has an estimated size of 255.8 MB

The npartitions has dropped from thousands to a handful, so every subsequent step is much cheaper.

We can also use a pre-built ConeSearch object, which lets us reuse the same region across multiple catalogs.

[6]:

from lsdb import ConeSearch

cone = ConeSearch(ra=200.0, dec=30.0, radius_arcsec=1 * 3600)
ps1_cone = ps1_object.search(cone)
ps1_cone

[6]:

lsdb Catalog otmo:

	objID	raMean	decMean	gMeanPSFMag	rMeanPSFMag	iMeanPSFMag	nDetections
npartitions=4
Order: 5, Pixel: 2596	int64[pyarrow]	double[pyarrow]	double[pyarrow]	double[pyarrow]	double[pyarrow]	double[pyarrow]	int16[pyarrow]
Order: 5, Pixel: 2597	...	...	...	...	...	...	...
Order: 5, Pixel: 2598	...	...	...	...	...	...	...
Order: 5, Pixel: 2599	...	...	...	...	...	...	...

7 out of 131 available columns in the catalog have been loaded lazily, meaning no data has been read, only the catalog schema

This catalog has an estimated size of 255.8 MB

3. Partition selection#

Even within a small region you may want to look at a single partition in isolation. Use .partitions[i] to index into the catalog by partition number. The result is a lazy Dask DataFrame for that one partition.

This is useful when you want to call .compute() on just one chunk to inspect values or test a function without touching the rest of the catalog.

[7]:

# Look at the first partition
first_partition = ps1_cone.partitions[0]
first_partition.compute()

[7]:

	objID	raMean	decMean	gMeanPSFMag	rMeanPSFMag	iMeanPSFMag	nDetections
_healpix_29
730840911935789259	142942005432773466	200.543267	29.119078	-999.0	-999.0	22.1049	1
730840958034592453	142942005448135918	200.544851	29.121114	-999.0	-999.0	-999.0	1
...	...	...	...	...	...	...	...
730990514047383426	143991996861207234	199.686118	29.99723	-999.0	-999.0	-999.0	1
730990514483634856	144001996877830045	199.687771	29.999566	-999.0	-999.0	22.3608	1

181808 rows × 7 columns

You can find the HEALPix pixel that corresponds to a given partition index using get_healpix_pixels().

[8]:

pixels = ps1_cone.get_healpix_pixels()

print(f"All pixels covered: {pixels}")
print(f"Partition 0 covers: {pixels[0]}")

All pixels covered: [Order: 5, Pixel: 2596, Order: 5, Pixel: 2597, Order: 5, Pixel: 2598, Order: 5, Pixel: 2599]
Partition 0 covers: Order: 5, Pixel: 2596

4. Sub-filtering#

Row filters let you trim the data further before any expensive computation. For example, selecting only bright stars reduces the number of rows dramatically and gives you a representative but manageable subset to work with.

[9]:

ps1_cone_and_bright = ps1_cone.query("0 < gMeanPSFMag < 16")
ps1_cone_and_bright

[9]:

lsdb Catalog otmo:

	objID	raMean	decMean	gMeanPSFMag	rMeanPSFMag	iMeanPSFMag	nDetections
npartitions=4
Order: 5, Pixel: 2596	int64[pyarrow]	double[pyarrow]	double[pyarrow]	double[pyarrow]	double[pyarrow]	double[pyarrow]	int16[pyarrow]
Order: 5, Pixel: 2597	...	...	...	...	...	...	...
Order: 5, Pixel: 2598	...	...	...	...	...	...	...
Order: 5, Pixel: 2599	...	...	...	...	...	...	...

7 out of 131 available columns in the catalog have been loaded lazily, meaning no data has been read, only the catalog schema

This catalog has an estimated size of 255.8 MB

[10]:

ps1_cone_and_bright.head(5)

[10]:

	objID	raMean	decMean	gMeanPSFMag	rMeanPSFMag	iMeanPSFMag	nDetections
_healpix_29
730844140320140186	142832001473917106	200.147265	29.030416	14.7094	14.0439	13.8106	55
730844900543284476	142842001903509791	200.190449	29.04093	15.1536	14.6864	14.5175	55
730845199124677148	142912002434617897	200.243435	29.097786	15.3492	14.9563	14.8328	69
730845991608593391	142852003372855574	200.337294	29.045833	15.9104	15.4282	15.2398	68
730846292154371814	142912004225284029	200.422788	29.09478	12.0536	11.207	10.9045	49

5 rows × 7 columns

Filters compose: you can chain a spatial filter with a row filter and LSDB will push both into the same pipeline.

5. Peeking at every partition#

map_partitions applies a function to each partition individually. Passing pd.DataFrame.head (or a small wrapper around it) is a cheap way to fetch the first few rows of every partition without loading the full catalog into memory.

This is especially useful for checking that a transformation produces the expected columns and values across all partition boundaries.

[11]:

# Grab the first 3 rows from each partition, then compute
sample_per_partition = ps1_cone.map_partitions(lambda df: df.head(3))
sample_per_partition.compute()

[11]:

	objID	raMean	decMean	gMeanPSFMag	rMeanPSFMag	iMeanPSFMag	nDetections
_healpix_29
730840911935789259	142942005432773466	200.543267	29.119078	-999.0	-999.0	22.1049	1
730840958034592453	142942005448135918	200.544851	29.121114	-999.0	-999.0	-999.0	1
730840958090282340	142942005456926339	200.545683	29.121461	-999.0	-999.0	-999.0	1
731028614286063577	142972005930353041	200.593027	29.143732	-999.0	21.5769	-999.0	1
731028614545892412	142972005940344460	200.594009	29.144912	22.1401	-999.0	-999.0	1
731028614842509175	142972005928714895	200.59283	29.145228	-999.0	21.4786	-999.0	2
731343459159795912	143281990771334860	199.077096	29.40356	-999.0	-999.0	-999.0	1
731343462332759277	143281990795017790	199.079454	29.406004	18.150801	-999.0	-999.0	1
731343462646587519	143281990763998560	199.076392	29.406652	-999.0	21.8297	-999.0	1
731553465279730385	144001996876174113	199.687558	30.002921	22.3853	21.153799	20.791201	50
731553465557381929	144001996902904284	199.690246	30.003063	21.1381	-999.0	-999.0	1
731553465612402991	144001996909224697	199.690881	30.00342	20.8773	-999.0	-999.0	1

12 rows × 7 columns

You can pass pd.DataFrame.head directly as the function, along with n as an extra keyword argument.

[12]:

sample_per_partition = ps1_cone.map_partitions(pd.DataFrame.head, n=3)
sample_per_partition.compute()

[12]:

	objID	raMean	decMean	gMeanPSFMag	rMeanPSFMag	iMeanPSFMag	nDetections
_healpix_29
730840911935789259	142942005432773466	200.543267	29.119078	-999.0	-999.0	22.1049	1
730840958034592453	142942005448135918	200.544851	29.121114	-999.0	-999.0	-999.0	1
730840958090282340	142942005456926339	200.545683	29.121461	-999.0	-999.0	-999.0	1
731028614286063577	142972005930353041	200.593027	29.143732	-999.0	21.5769	-999.0	1
731028614545892412	142972005940344460	200.594009	29.144912	22.1401	-999.0	-999.0	1
731028614842509175	142972005928714895	200.59283	29.145228	-999.0	21.4786	-999.0	2
731343459159795912	143281990771334860	199.077096	29.40356	-999.0	-999.0	-999.0	1
731343462332759277	143281990795017790	199.079454	29.406004	18.150801	-999.0	-999.0	1
731343462646587519	143281990763998560	199.076392	29.406652	-999.0	21.8297	-999.0	1
731553465279730385	144001996876174113	199.687558	30.002921	22.3853	21.153799	20.791201	50
731553465557381929	144001996902904284	199.690246	30.003063	21.1381	-999.0	-999.0	1
731553465612402991	144001996909224697	199.690881	30.00342	20.8773	-999.0	-999.0	1

12 rows × 7 columns

6. Random sample#

.random_sample(n) draws approximately n rows distributed proportionally across all partitions. Unlike .head(), which always returns rows from the first partitions, a random sample is representative of the whole catalog.

Use .random_sample() when you need a statistical cross-section of the data — for example, to estimate a distribution or spot-check the output of a filter.

Pass a seed for reproducible results.

[13]:

sample = ps1_cone.random_sample(n=20, seed=42)
sample

[13]:

	objID	raMean	decMean	gMeanPSFMag	rMeanPSFMag	iMeanPSFMag	nDetections
_healpix_29
730938404934408931	142872000078103734	200.00781	29.060969	22.1145	-999.0	-999.0	1
730951609767451340	143292000219073004	200.021908	29.410356	-999.0	-999.0	-999.0	0
730989736627872106	143871997131643638	199.713138	29.894212	-999.0	21.0954	-999.0	1
730954465485313004	143452000401119829	200.040097	29.549376	-999.0	22.328501	-999.0	1
730971566992324533	143401992605722578	199.26058	29.501658	-999.0	-999.0	-999.0	0
731135133699300681	143462003580377415	200.35807	29.555692	-999.0	20.7812	-999.0	1
731203989392608232	144162009973134398	200.997327	30.136522	-999.0	-999.0	22.111	1
731189428360399371	144362005389671965	200.538968	30.301161	-999.0	-999.0	-999.0	0
731055181736796326	143612008486707542	200.848729	29.680786	20.2108	19.357599	18.9604	85
731169607151316760	143831999264840022	199.926533	29.857879	-999.0	-999.0	21.718	1
731243544436168318	144762008146912939	200.814668	30.635349	-999.0	21.233801	-999.0	1
731157044077607572	144122009772547734	200.977252	30.105966	-999.0	-999.0	-999.0	1
731053486830327918	143562010648924439	201.064892	29.636554	-999.0	-999.0	-999.0	0
731167330146024164	143712000614839654	200.061535	29.765822	-999.0	21.8267	21.0362	30
731191974749863641	144492005516170195	200.551599	30.408011	-999.0	-999.0	-999.0	0
731132685965390440	143372005288103087	200.528768	29.477088	-999.0	-999.0	22.2656	1
731139989125891498	143752005642423563	200.564231	29.794141	-999.0	-999.0	-999.0	1
731365479980035113	143931996133740553	199.613369	29.941676	-999.0	21.4872	-999.0	1
731556270631596121	144191995653174817	199.56531	30.161887	21.5448	-999.0	-999.0	1
731573874711976339	144591999280778663	199.928066	30.498454	-999.0	-999.0	-999.0	1

20 rows × 7 columns

If you only want to sample from a single partition, use .sample(partition_id, n) instead. This avoids touching any other partition.

[14]:

single_partition_sample = ps1_cone.sample(partition_id=0, n=5, seed=42)
single_partition_sample

[14]:

	objID	raMean	decMean	gMeanPSFMag	rMeanPSFMag	iMeanPSFMag	nDetections
_healpix_29
730987288049027358	143721997982785268	199.798242	29.770542	-999.0	-999.0	21.0397	1
730981551741026122	143731998796059763	199.879675	29.782665	21.371401	-999.0	-999.0	1
730985541414127631	143711995397904131	199.539772	29.761256	20.6035	19.4016	17.886	55
730981042109190604	143671998765036022	199.876517	29.729563	20.7929	-999.0	-999.0	1
730990099999889665	143911996683998793	199.668342	29.931828	-999.0	18.621201	-999.0	1

5 rows × 7 columns

Closing the Dask client#

[15]:

client.close()

About#

Authors: Olivia Lynn

Last updated on: May 18, 2026

If you use lsdb for published research, please cite following instructions.

Small-Scale Analysis

Contents

Small-Scale Analysis#

Introduction#

1. Open a catalog#

2. Region selection#

3. Partition selection#

4. Sub-filtering#

5. Peeking at every partition#

6. Random sample#

Closing the Dask client#

About#