Small-Scale Analysis#

In this tutorial, we will cover strategies for working with a small slice of a large catalog before committing to a full-scale computation:

  • narrow the sky area with a spatial filter (e.g., cone_search)

  • inspect a single partition with .partitions[i]

  • filter rows to a manageable subset (e.g., bright stars)

  • peek at the first few rows of every partition with map_partitions

  • draw a random sample with .random_sample()

Introduction#

Large astronomical catalogs can contain billions of rows spread across thousands of partitions. Running a pipeline on the full dataset is expensive, and it is easy to waste hours on a bug that could have been caught in seconds on a small slice.

A good workflow starts small:

  1. Narrow the sky: work only in the patch of sky you actually care about.

  2. Inspect one partition: confirm the data looks right before processing everything.

  3. Filter aggressively: drop rows you do not need as early as possible.

  4. Peek at multiple partitions: cheaply verify your function behaves correctly across partition boundaries.

  5. Draw a random sample: get a statistically representative preview without a full compute.

Each technique in this tutorial reduces the amount of data you touch, so you can iterate quickly and scale up only once you are confident in the result.

[1]:
import pandas as pd

import lsdb
from dask.distributed import Client

1. Open a catalog#

Additional Help

For additional information on dask client creation, please refer to the official Dask documentation and our Dask cluster configuration page for LSDB-specific tips. Note that dask also provides its own best practices, which may also be useful to consult.

For tips on accessing remote data, see our Accessing remote data guide

[2]:
client = Client(n_workers=4, memory_limit="auto")

We open the Pan-STARRS1 (PS1) object catalog. The catalog is loaded lazily–no row data is read yet.

[3]:
ps1_object = lsdb.open_catalog("s3://stpubdata/panstarrs/ps1/public/hats/otmo")
ps1_object
[3]:
lsdb Catalog otmo:
decMean decMeanErr epochMean gFlags gMeanPSFMag gMeanPSFMagErr iFlags iMeanPSFMag iMeanPSFMagErr nDetections ng ni nr ny nz objID objInfoFlag qualityFlag raMean raMeanErr rFlags rMeanPSFMag rMeanPSFMagErr surveyID yFlags yMeanPSFMag yMeanPSFMagErr zFlags zMeanPSFMag zMeanPSFMagErr
npartitions=9577
Order: 5, Pixel: 0 double[pyarrow] double[pyarrow] double[pyarrow] int32[pyarrow] double[pyarrow] double[pyarrow] int32[pyarrow] double[pyarrow] double[pyarrow] int16[pyarrow] int16[pyarrow] int16[pyarrow] int16[pyarrow] int16[pyarrow] int16[pyarrow] int64[pyarrow] int32[pyarrow] int16[pyarrow] double[pyarrow] double[pyarrow] int32[pyarrow] double[pyarrow] double[pyarrow] int16[pyarrow] int32[pyarrow] double[pyarrow] double[pyarrow] int32[pyarrow] double[pyarrow] double[pyarrow]
Order: 5, Pixel: 1 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Order: 5, Pixel: 12286 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Order: 5, Pixel: 12287 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
30 out of 131 available columns in the catalog have been loaded lazily, meaning no data has been read, only the catalog schema
This catalog has an estimated size of 1.9 TB

Note the number of columns shown.

Some catalogs come with a pre-specified set of “default columns” that will be loaded automatically (unless the columns='all' keyword is specified). We can also always manually specify which columns we’d like to load.

Let’s reduce the amount of data we’ll handle as we work with this catalog.

[4]:
ps1_object = lsdb.open_catalog(
    "s3://stpubdata/panstarrs/ps1/public/hats/otmo",
    columns=["objID", "raMean", "decMean", "gMeanPSFMag", "rMeanPSFMag", "iMeanPSFMag", "nDetections"],
)
ps1_object
[4]:
lsdb Catalog otmo:
objID raMean decMean gMeanPSFMag rMeanPSFMag iMeanPSFMag nDetections
npartitions=9577
Order: 5, Pixel: 0 int64[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] int16[pyarrow]
Order: 5, Pixel: 1 ... ... ... ... ... ... ...
... ... ... ... ... ... ... ...
Order: 5, Pixel: 12286 ... ... ... ... ... ... ...
Order: 5, Pixel: 12287 ... ... ... ... ... ... ...
7 out of 131 available columns in the catalog have been loaded lazily, meaning no data has been read, only the catalog schema
This catalog has an estimated size of 612.5 GB

2. Region selection#

The simplest way to reduce the amount of data you work with is to restrict the sky area. A cone_search keeps only the partitions that overlap a circle defined by a center (ra, dec) and radius_arcsec.

Starting with a small cone lets you develop and test your pipeline on a tiny fraction of the catalog. Once the pipeline is correct, you can widen the cone or remove it entirely.

[5]:
ps1_cone = ps1_object.cone_search(ra=200.0, dec=30.0, radius_arcsec=1 * 3600)
ps1_cone
[5]:
lsdb Catalog otmo:
objID raMean decMean gMeanPSFMag rMeanPSFMag iMeanPSFMag nDetections
npartitions=4
Order: 5, Pixel: 2596 int64[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] int16[pyarrow]
Order: 5, Pixel: 2597 ... ... ... ... ... ... ...
Order: 5, Pixel: 2598 ... ... ... ... ... ... ...
Order: 5, Pixel: 2599 ... ... ... ... ... ... ...
7 out of 131 available columns in the catalog have been loaded lazily, meaning no data has been read, only the catalog schema
This catalog has an estimated size of 255.8 MB

The npartitions has dropped from thousands to a handful, so every subsequent step is much cheaper.

We can also use a pre-built ConeSearch object, which lets us reuse the same region across multiple catalogs.

[6]:
from lsdb import ConeSearch

cone = ConeSearch(ra=200.0, dec=30.0, radius_arcsec=1 * 3600)
ps1_cone = ps1_object.search(cone)
ps1_cone
[6]:
lsdb Catalog otmo:
objID raMean decMean gMeanPSFMag rMeanPSFMag iMeanPSFMag nDetections
npartitions=4
Order: 5, Pixel: 2596 int64[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] int16[pyarrow]
Order: 5, Pixel: 2597 ... ... ... ... ... ... ...
Order: 5, Pixel: 2598 ... ... ... ... ... ... ...
Order: 5, Pixel: 2599 ... ... ... ... ... ... ...
7 out of 131 available columns in the catalog have been loaded lazily, meaning no data has been read, only the catalog schema
This catalog has an estimated size of 255.8 MB

3. Partition selection#

Even within a small region you may want to look at a single partition in isolation. Use .partitions[i] to index into the catalog by partition number. The result is a lazy Dask DataFrame for that one partition.

This is useful when you want to call .compute() on just one chunk to inspect values or test a function without touching the rest of the catalog.

[7]:
# Look at the first partition
first_partition = ps1_cone.partitions[0]
first_partition.compute()
[7]:
objID raMean decMean gMeanPSFMag rMeanPSFMag iMeanPSFMag nDetections
_healpix_29
730840911935789259 142942005432773466 200.543267 29.119078 -999.0 -999.0 22.1049 1
730840958034592453 142942005448135918 200.544851 29.121114 -999.0 -999.0 -999.0 1
... ... ... ... ... ... ... ...
730990514047383426 143991996861207234 199.686118 29.99723 -999.0 -999.0 -999.0 1
730990514483634856 144001996877830045 199.687771 29.999566 -999.0 -999.0 22.3608 1

181808 rows × 7 columns

You can find the HEALPix pixel that corresponds to a given partition index using get_healpix_pixels().

[8]:
pixels = ps1_cone.get_healpix_pixels()

print(f"All pixels covered: {pixels}")
print(f"Partition 0 covers: {pixels[0]}")
All pixels covered: [Order: 5, Pixel: 2596, Order: 5, Pixel: 2597, Order: 5, Pixel: 2598, Order: 5, Pixel: 2599]
Partition 0 covers: Order: 5, Pixel: 2596

4. Sub-filtering#

Row filters let you trim the data further before any expensive computation. For example, selecting only bright stars reduces the number of rows dramatically and gives you a representative but manageable subset to work with.

[9]:
ps1_cone_and_bright = ps1_cone.query("0 < gMeanPSFMag < 16")
ps1_cone_and_bright
[9]:
lsdb Catalog otmo:
objID raMean decMean gMeanPSFMag rMeanPSFMag iMeanPSFMag nDetections
npartitions=4
Order: 5, Pixel: 2596 int64[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] int16[pyarrow]
Order: 5, Pixel: 2597 ... ... ... ... ... ... ...
Order: 5, Pixel: 2598 ... ... ... ... ... ... ...
Order: 5, Pixel: 2599 ... ... ... ... ... ... ...
7 out of 131 available columns in the catalog have been loaded lazily, meaning no data has been read, only the catalog schema
This catalog has an estimated size of 255.8 MB
[10]:
ps1_cone_and_bright.head(5)
[10]:
objID raMean decMean gMeanPSFMag rMeanPSFMag iMeanPSFMag nDetections
_healpix_29
730844140320140186 142832001473917106 200.147265 29.030416 14.7094 14.0439 13.8106 55
730844900543284476 142842001903509791 200.190449 29.04093 15.1536 14.6864 14.5175 55
730845199124677148 142912002434617897 200.243435 29.097786 15.3492 14.9563 14.8328 69
730845991608593391 142852003372855574 200.337294 29.045833 15.9104 15.4282 15.2398 68
730846292154371814 142912004225284029 200.422788 29.09478 12.0536 11.207 10.9045 49

5 rows × 7 columns

Filters compose: you can chain a spatial filter with a row filter and LSDB will push both into the same pipeline.

5. Peeking at every partition#

map_partitions applies a function to each partition individually. Passing pd.DataFrame.head (or a small wrapper around it) is a cheap way to fetch the first few rows of every partition without loading the full catalog into memory.

This is especially useful for checking that a transformation produces the expected columns and values across all partition boundaries.

[11]:
# Grab the first 3 rows from each partition, then compute
sample_per_partition = ps1_cone.map_partitions(lambda df: df.head(3))
sample_per_partition.compute()
[11]:
objID raMean decMean gMeanPSFMag rMeanPSFMag iMeanPSFMag nDetections
_healpix_29
730840911935789259 142942005432773466 200.543267 29.119078 -999.0 -999.0 22.1049 1
730840958034592453 142942005448135918 200.544851 29.121114 -999.0 -999.0 -999.0 1
730840958090282340 142942005456926339 200.545683 29.121461 -999.0 -999.0 -999.0 1
731028614286063577 142972005930353041 200.593027 29.143732 -999.0 21.5769 -999.0 1
731028614545892412 142972005940344460 200.594009 29.144912 22.1401 -999.0 -999.0 1
731028614842509175 142972005928714895 200.59283 29.145228 -999.0 21.4786 -999.0 2
731343459159795912 143281990771334860 199.077096 29.40356 -999.0 -999.0 -999.0 1
731343462332759277 143281990795017790 199.079454 29.406004 18.150801 -999.0 -999.0 1
731343462646587519 143281990763998560 199.076392 29.406652 -999.0 21.8297 -999.0 1
731553465279730385 144001996876174113 199.687558 30.002921 22.3853 21.153799 20.791201 50
731553465557381929 144001996902904284 199.690246 30.003063 21.1381 -999.0 -999.0 1
731553465612402991 144001996909224697 199.690881 30.00342 20.8773 -999.0 -999.0 1

12 rows × 7 columns

You can pass pd.DataFrame.head directly as the function, along with n as an extra keyword argument.

[12]:
sample_per_partition = ps1_cone.map_partitions(pd.DataFrame.head, n=3)
sample_per_partition.compute()
[12]:
objID raMean decMean gMeanPSFMag rMeanPSFMag iMeanPSFMag nDetections
_healpix_29
730840911935789259 142942005432773466 200.543267 29.119078 -999.0 -999.0 22.1049 1
730840958034592453 142942005448135918 200.544851 29.121114 -999.0 -999.0 -999.0 1
730840958090282340 142942005456926339 200.545683 29.121461 -999.0 -999.0 -999.0 1
731028614286063577 142972005930353041 200.593027 29.143732 -999.0 21.5769 -999.0 1
731028614545892412 142972005940344460 200.594009 29.144912 22.1401 -999.0 -999.0 1
731028614842509175 142972005928714895 200.59283 29.145228 -999.0 21.4786 -999.0 2
731343459159795912 143281990771334860 199.077096 29.40356 -999.0 -999.0 -999.0 1
731343462332759277 143281990795017790 199.079454 29.406004 18.150801 -999.0 -999.0 1
731343462646587519 143281990763998560 199.076392 29.406652 -999.0 21.8297 -999.0 1
731553465279730385 144001996876174113 199.687558 30.002921 22.3853 21.153799 20.791201 50
731553465557381929 144001996902904284 199.690246 30.003063 21.1381 -999.0 -999.0 1
731553465612402991 144001996909224697 199.690881 30.00342 20.8773 -999.0 -999.0 1

12 rows × 7 columns

6. Random sample#

.random_sample(n) draws approximately n rows distributed proportionally across all partitions. Unlike .head(), which always returns rows from the first partitions, a random sample is representative of the whole catalog.

Use .random_sample() when you need a statistical cross-section of the data — for example, to estimate a distribution or spot-check the output of a filter.

Pass a seed for reproducible results.

[13]:
sample = ps1_cone.random_sample(n=20, seed=42)
sample
[13]:
objID raMean decMean gMeanPSFMag rMeanPSFMag iMeanPSFMag nDetections
_healpix_29
730844268050948686 142852001373816342 200.137386 29.046464 -999.0 -999.0 -999.0 1
730989043331112969 143791995315125687 199.531497 29.82925 -999.0 -999.0 18.3598 1
730926707248634957 142851999598341249 199.959858 29.042176 -999.0 20.9802 -999.0 1
730982536008697912 143521994657873757 199.465796 29.602795 -999.0 -999.0 -999.0 2
730963334950567483 143261995665830974 199.566563 29.383677 -999.0 -999.0 -999.0 0
731175403661086290 143851998527645790 199.852767 29.879339 -999.0 -999.0 -999.0 1
731180320653276898 144081999691175722 199.969165 30.07097 -999.0 -999.0 -999.0 1
731155474866300796 143932008420672300 200.842092 29.943096 22.135799 -999.0 -999.0 1
731197537852881040 144492004307101432 200.430716 30.409051 -999.0 22.1796 -999.0 1
731170999643731892 143842002019457849 200.201928 29.872711 -999.0 -999.0 -999.0 1
731055202718820346 143632008710849180 200.871113 29.698824 -999.0 -999.0 22.003201 1
731134015411514532 143352002608060333 200.260821 29.458154 -999.0 -999.0 22.2115 1
731376270565617841 144261992830328522 199.283076 30.223271 -999.0 -999.0 21.1803 1
731364398762073972 143821994989397857 199.498968 29.856063 -999.0 -999.0 -999.0 1
731358378870915554 143771988697657383 198.86976 29.813933 21.095699 -999.0 -999.0 1
731356009656165493 143691989932453073 198.993257 29.743755 22.583599 -999.0 -999.0 1
731358123962301091 143791989650376474 198.965047 29.829922 -999.0 21.5166 -999.0 1
731583062024405850 144881997428337812 199.742829 30.739355 -999.0 -999.0 -999.0 1
731556046543862818 144201996584775344 199.658467 30.170666 -999.0 -999.0 -999.0 1
731591348034351183 144641993066913148 199.306691 30.535476 -999.0 -999.0 -999.0 0

20 rows × 7 columns

If you only want to sample from a single partition, use .sample(partition_id, n) instead. This avoids touching any other partition.

[14]:
single_partition_sample = ps1_cone.sample(partition_id=0, n=5, seed=42)
single_partition_sample
[14]:
objID raMean decMean gMeanPSFMag rMeanPSFMag iMeanPSFMag nDetections
_healpix_29
730986934050957759 143721996413759898 199.641316 29.77443 -999.0 17.7243 -999.0 1
730962212147684672 143131994672218264 199.467215 29.281401 18.767401 -999.0 -999.0 1
730979986766383795 143591997990286372 199.799022 29.663177 21.001101 -999.0 -999.0 1
730976899401309098 143501996705113072 199.670464 29.585438 -999.0 -999.0 21.7351 2
730937483769059421 143091997002705155 199.70026 29.2455 -999.0 -999.0 21.798201 1

5 rows × 7 columns

Closing the Dask client#

[15]:
client.close()

About#

Authors: Olivia Lynn

Last updated on: May 18, 2026

If you use lsdb for published research, please cite following instructions.