Small-Scale Analysis#
In this tutorial, we will cover strategies for working with a small slice of a large catalog before committing to a full-scale computation:
narrow the sky area with a spatial filter (e.g.,
cone_search)inspect a single partition with
.partitions[i]filter rows to a manageable subset (e.g., bright stars)
peek at the first few rows of every partition with
map_partitionsdraw a random sample with
.random_sample()
Introduction#
Large astronomical catalogs can contain billions of rows spread across thousands of partitions. Running a pipeline on the full dataset is expensive, and it is easy to waste hours on a bug that could have been caught in seconds on a small slice.
A good workflow starts small:
Narrow the sky: work only in the patch of sky you actually care about.
Inspect one partition: confirm the data looks right before processing everything.
Filter aggressively: drop rows you do not need as early as possible.
Peek at multiple partitions: cheaply verify your function behaves correctly across partition boundaries.
Draw a random sample: get a statistically representative preview without a full compute.
Each technique in this tutorial reduces the amount of data you touch, so you can iterate quickly and scale up only once you are confident in the result.
[1]:
import pandas as pd
import lsdb
from dask.distributed import Client
1. Open a catalog#
Additional Help
For additional information on dask client creation, please refer to the official Dask documentation and our Dask cluster configuration page for LSDB-specific tips. Note that dask also provides its own best practices, which may also be useful to consult.
For tips on accessing remote data, see our Accessing remote data guide
[2]:
client = Client(n_workers=4, memory_limit="auto")
We open the Pan-STARRS1 (PS1) object catalog. The catalog is loaded lazily–no row data is read yet.
[3]:
ps1_object = lsdb.open_catalog("s3://stpubdata/panstarrs/ps1/public/hats/otmo")
ps1_object
[3]:
| decMean | decMeanErr | epochMean | gFlags | gMeanPSFMag | gMeanPSFMagErr | iFlags | iMeanPSFMag | iMeanPSFMagErr | nDetections | ng | ni | nr | ny | nz | objID | objInfoFlag | qualityFlag | raMean | raMeanErr | rFlags | rMeanPSFMag | rMeanPSFMagErr | surveyID | yFlags | yMeanPSFMag | yMeanPSFMagErr | zFlags | zMeanPSFMag | zMeanPSFMagErr | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| npartitions=9577 | ||||||||||||||||||||||||||||||
| Order: 5, Pixel: 0 | double[pyarrow] | double[pyarrow] | double[pyarrow] | int32[pyarrow] | double[pyarrow] | double[pyarrow] | int32[pyarrow] | double[pyarrow] | double[pyarrow] | int16[pyarrow] | int16[pyarrow] | int16[pyarrow] | int16[pyarrow] | int16[pyarrow] | int16[pyarrow] | int64[pyarrow] | int32[pyarrow] | int16[pyarrow] | double[pyarrow] | double[pyarrow] | int32[pyarrow] | double[pyarrow] | double[pyarrow] | int16[pyarrow] | int32[pyarrow] | double[pyarrow] | double[pyarrow] | int32[pyarrow] | double[pyarrow] | double[pyarrow] |
| Order: 5, Pixel: 1 | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| Order: 5, Pixel: 12286 | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| Order: 5, Pixel: 12287 | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Note the number of columns shown.
Some catalogs come with a pre-specified set of “default columns” that will be loaded automatically (unless the columns='all' keyword is specified). We can also always manually specify which columns we’d like to load.
Let’s reduce the amount of data we’ll handle as we work with this catalog.
[4]:
ps1_object = lsdb.open_catalog(
"s3://stpubdata/panstarrs/ps1/public/hats/otmo",
columns=["objID", "raMean", "decMean", "gMeanPSFMag", "rMeanPSFMag", "iMeanPSFMag", "nDetections"],
)
ps1_object
[4]:
| objID | raMean | decMean | gMeanPSFMag | rMeanPSFMag | iMeanPSFMag | nDetections | |
|---|---|---|---|---|---|---|---|
| npartitions=9577 | |||||||
| Order: 5, Pixel: 0 | int64[pyarrow] | double[pyarrow] | double[pyarrow] | double[pyarrow] | double[pyarrow] | double[pyarrow] | int16[pyarrow] |
| Order: 5, Pixel: 1 | ... | ... | ... | ... | ... | ... | ... |
| ... | ... | ... | ... | ... | ... | ... | ... |
| Order: 5, Pixel: 12286 | ... | ... | ... | ... | ... | ... | ... |
| Order: 5, Pixel: 12287 | ... | ... | ... | ... | ... | ... | ... |
2. Region selection#
The simplest way to reduce the amount of data you work with is to restrict the sky area. A cone_search keeps only the partitions that overlap a circle defined by a center (ra, dec) and radius_arcsec.
Starting with a small cone lets you develop and test your pipeline on a tiny fraction of the catalog. Once the pipeline is correct, you can widen the cone or remove it entirely.
[5]:
ps1_cone = ps1_object.cone_search(ra=200.0, dec=30.0, radius_arcsec=1 * 3600)
ps1_cone
[5]:
| objID | raMean | decMean | gMeanPSFMag | rMeanPSFMag | iMeanPSFMag | nDetections | |
|---|---|---|---|---|---|---|---|
| npartitions=4 | |||||||
| Order: 5, Pixel: 2596 | int64[pyarrow] | double[pyarrow] | double[pyarrow] | double[pyarrow] | double[pyarrow] | double[pyarrow] | int16[pyarrow] |
| Order: 5, Pixel: 2597 | ... | ... | ... | ... | ... | ... | ... |
| Order: 5, Pixel: 2598 | ... | ... | ... | ... | ... | ... | ... |
| Order: 5, Pixel: 2599 | ... | ... | ... | ... | ... | ... | ... |
The npartitions has dropped from thousands to a handful, so every subsequent step is much cheaper.
We can also use a pre-built ConeSearch object, which lets us reuse the same region across multiple catalogs.
[6]:
from lsdb import ConeSearch
cone = ConeSearch(ra=200.0, dec=30.0, radius_arcsec=1 * 3600)
ps1_cone = ps1_object.search(cone)
ps1_cone
[6]:
| objID | raMean | decMean | gMeanPSFMag | rMeanPSFMag | iMeanPSFMag | nDetections | |
|---|---|---|---|---|---|---|---|
| npartitions=4 | |||||||
| Order: 5, Pixel: 2596 | int64[pyarrow] | double[pyarrow] | double[pyarrow] | double[pyarrow] | double[pyarrow] | double[pyarrow] | int16[pyarrow] |
| Order: 5, Pixel: 2597 | ... | ... | ... | ... | ... | ... | ... |
| Order: 5, Pixel: 2598 | ... | ... | ... | ... | ... | ... | ... |
| Order: 5, Pixel: 2599 | ... | ... | ... | ... | ... | ... | ... |
3. Partition selection#
Even within a small region you may want to look at a single partition in isolation. Use .partitions[i] to index into the catalog by partition number. The result is a lazy Dask DataFrame for that one partition.
This is useful when you want to call .compute() on just one chunk to inspect values or test a function without touching the rest of the catalog.
[7]:
# Look at the first partition
first_partition = ps1_cone.partitions[0]
first_partition.compute()
[7]:
| objID | raMean | decMean | gMeanPSFMag | rMeanPSFMag | iMeanPSFMag | nDetections | |
|---|---|---|---|---|---|---|---|
| _healpix_29 | |||||||
| 730840911935789259 | 142942005432773466 | 200.543267 | 29.119078 | -999.0 | -999.0 | 22.1049 | 1 |
| 730840958034592453 | 142942005448135918 | 200.544851 | 29.121114 | -999.0 | -999.0 | -999.0 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 730990514047383426 | 143991996861207234 | 199.686118 | 29.99723 | -999.0 | -999.0 | -999.0 | 1 |
| 730990514483634856 | 144001996877830045 | 199.687771 | 29.999566 | -999.0 | -999.0 | 22.3608 | 1 |
181808 rows × 7 columns
You can find the HEALPix pixel that corresponds to a given partition index using get_healpix_pixels().
[8]:
pixels = ps1_cone.get_healpix_pixels()
print(f"All pixels covered: {pixels}")
print(f"Partition 0 covers: {pixels[0]}")
All pixels covered: [Order: 5, Pixel: 2596, Order: 5, Pixel: 2597, Order: 5, Pixel: 2598, Order: 5, Pixel: 2599]
Partition 0 covers: Order: 5, Pixel: 2596
4. Sub-filtering#
Row filters let you trim the data further before any expensive computation. For example, selecting only bright stars reduces the number of rows dramatically and gives you a representative but manageable subset to work with.
[9]:
ps1_cone_and_bright = ps1_cone.query("0 < gMeanPSFMag < 16")
ps1_cone_and_bright
[9]:
| objID | raMean | decMean | gMeanPSFMag | rMeanPSFMag | iMeanPSFMag | nDetections | |
|---|---|---|---|---|---|---|---|
| npartitions=4 | |||||||
| Order: 5, Pixel: 2596 | int64[pyarrow] | double[pyarrow] | double[pyarrow] | double[pyarrow] | double[pyarrow] | double[pyarrow] | int16[pyarrow] |
| Order: 5, Pixel: 2597 | ... | ... | ... | ... | ... | ... | ... |
| Order: 5, Pixel: 2598 | ... | ... | ... | ... | ... | ... | ... |
| Order: 5, Pixel: 2599 | ... | ... | ... | ... | ... | ... | ... |
[10]:
ps1_cone_and_bright.head(5)
[10]:
| objID | raMean | decMean | gMeanPSFMag | rMeanPSFMag | iMeanPSFMag | nDetections | |
|---|---|---|---|---|---|---|---|
| _healpix_29 | |||||||
| 730844140320140186 | 142832001473917106 | 200.147265 | 29.030416 | 14.7094 | 14.0439 | 13.8106 | 55 |
| 730844900543284476 | 142842001903509791 | 200.190449 | 29.04093 | 15.1536 | 14.6864 | 14.5175 | 55 |
| 730845199124677148 | 142912002434617897 | 200.243435 | 29.097786 | 15.3492 | 14.9563 | 14.8328 | 69 |
| 730845991608593391 | 142852003372855574 | 200.337294 | 29.045833 | 15.9104 | 15.4282 | 15.2398 | 68 |
| 730846292154371814 | 142912004225284029 | 200.422788 | 29.09478 | 12.0536 | 11.207 | 10.9045 | 49 |
5 rows × 7 columns
Filters compose: you can chain a spatial filter with a row filter and LSDB will push both into the same pipeline.
5. Peeking at every partition#
map_partitions applies a function to each partition individually. Passing pd.DataFrame.head (or a small wrapper around it) is a cheap way to fetch the first few rows of every partition without loading the full catalog into memory.
This is especially useful for checking that a transformation produces the expected columns and values across all partition boundaries.
[11]:
# Grab the first 3 rows from each partition, then compute
sample_per_partition = ps1_cone.map_partitions(lambda df: df.head(3))
sample_per_partition.compute()
[11]:
| objID | raMean | decMean | gMeanPSFMag | rMeanPSFMag | iMeanPSFMag | nDetections | |
|---|---|---|---|---|---|---|---|
| _healpix_29 | |||||||
| 730840911935789259 | 142942005432773466 | 200.543267 | 29.119078 | -999.0 | -999.0 | 22.1049 | 1 |
| 730840958034592453 | 142942005448135918 | 200.544851 | 29.121114 | -999.0 | -999.0 | -999.0 | 1 |
| 730840958090282340 | 142942005456926339 | 200.545683 | 29.121461 | -999.0 | -999.0 | -999.0 | 1 |
| 731028614286063577 | 142972005930353041 | 200.593027 | 29.143732 | -999.0 | 21.5769 | -999.0 | 1 |
| 731028614545892412 | 142972005940344460 | 200.594009 | 29.144912 | 22.1401 | -999.0 | -999.0 | 1 |
| 731028614842509175 | 142972005928714895 | 200.59283 | 29.145228 | -999.0 | 21.4786 | -999.0 | 2 |
| 731343459159795912 | 143281990771334860 | 199.077096 | 29.40356 | -999.0 | -999.0 | -999.0 | 1 |
| 731343462332759277 | 143281990795017790 | 199.079454 | 29.406004 | 18.150801 | -999.0 | -999.0 | 1 |
| 731343462646587519 | 143281990763998560 | 199.076392 | 29.406652 | -999.0 | 21.8297 | -999.0 | 1 |
| 731553465279730385 | 144001996876174113 | 199.687558 | 30.002921 | 22.3853 | 21.153799 | 20.791201 | 50 |
| 731553465557381929 | 144001996902904284 | 199.690246 | 30.003063 | 21.1381 | -999.0 | -999.0 | 1 |
| 731553465612402991 | 144001996909224697 | 199.690881 | 30.00342 | 20.8773 | -999.0 | -999.0 | 1 |
12 rows × 7 columns
You can pass pd.DataFrame.head directly as the function, along with n as an extra keyword argument.
[12]:
sample_per_partition = ps1_cone.map_partitions(pd.DataFrame.head, n=3)
sample_per_partition.compute()
[12]:
| objID | raMean | decMean | gMeanPSFMag | rMeanPSFMag | iMeanPSFMag | nDetections | |
|---|---|---|---|---|---|---|---|
| _healpix_29 | |||||||
| 730840911935789259 | 142942005432773466 | 200.543267 | 29.119078 | -999.0 | -999.0 | 22.1049 | 1 |
| 730840958034592453 | 142942005448135918 | 200.544851 | 29.121114 | -999.0 | -999.0 | -999.0 | 1 |
| 730840958090282340 | 142942005456926339 | 200.545683 | 29.121461 | -999.0 | -999.0 | -999.0 | 1 |
| 731028614286063577 | 142972005930353041 | 200.593027 | 29.143732 | -999.0 | 21.5769 | -999.0 | 1 |
| 731028614545892412 | 142972005940344460 | 200.594009 | 29.144912 | 22.1401 | -999.0 | -999.0 | 1 |
| 731028614842509175 | 142972005928714895 | 200.59283 | 29.145228 | -999.0 | 21.4786 | -999.0 | 2 |
| 731343459159795912 | 143281990771334860 | 199.077096 | 29.40356 | -999.0 | -999.0 | -999.0 | 1 |
| 731343462332759277 | 143281990795017790 | 199.079454 | 29.406004 | 18.150801 | -999.0 | -999.0 | 1 |
| 731343462646587519 | 143281990763998560 | 199.076392 | 29.406652 | -999.0 | 21.8297 | -999.0 | 1 |
| 731553465279730385 | 144001996876174113 | 199.687558 | 30.002921 | 22.3853 | 21.153799 | 20.791201 | 50 |
| 731553465557381929 | 144001996902904284 | 199.690246 | 30.003063 | 21.1381 | -999.0 | -999.0 | 1 |
| 731553465612402991 | 144001996909224697 | 199.690881 | 30.00342 | 20.8773 | -999.0 | -999.0 | 1 |
12 rows × 7 columns
6. Random sample#
.random_sample(n) draws approximately n rows distributed proportionally across all partitions. Unlike .head(), which always returns rows from the first partitions, a random sample is representative of the whole catalog.
Use .random_sample() when you need a statistical cross-section of the data — for example, to estimate a distribution or spot-check the output of a filter.
Pass a seed for reproducible results.
[13]:
sample = ps1_cone.random_sample(n=20, seed=42)
sample
[13]:
| objID | raMean | decMean | gMeanPSFMag | rMeanPSFMag | iMeanPSFMag | nDetections | |
|---|---|---|---|---|---|---|---|
| _healpix_29 | |||||||
| 730844268050948686 | 142852001373816342 | 200.137386 | 29.046464 | -999.0 | -999.0 | -999.0 | 1 |
| 730989043331112969 | 143791995315125687 | 199.531497 | 29.82925 | -999.0 | -999.0 | 18.3598 | 1 |
| 730926707248634957 | 142851999598341249 | 199.959858 | 29.042176 | -999.0 | 20.9802 | -999.0 | 1 |
| 730982536008697912 | 143521994657873757 | 199.465796 | 29.602795 | -999.0 | -999.0 | -999.0 | 2 |
| 730963334950567483 | 143261995665830974 | 199.566563 | 29.383677 | -999.0 | -999.0 | -999.0 | 0 |
| 731175403661086290 | 143851998527645790 | 199.852767 | 29.879339 | -999.0 | -999.0 | -999.0 | 1 |
| 731180320653276898 | 144081999691175722 | 199.969165 | 30.07097 | -999.0 | -999.0 | -999.0 | 1 |
| 731155474866300796 | 143932008420672300 | 200.842092 | 29.943096 | 22.135799 | -999.0 | -999.0 | 1 |
| 731197537852881040 | 144492004307101432 | 200.430716 | 30.409051 | -999.0 | 22.1796 | -999.0 | 1 |
| 731170999643731892 | 143842002019457849 | 200.201928 | 29.872711 | -999.0 | -999.0 | -999.0 | 1 |
| 731055202718820346 | 143632008710849180 | 200.871113 | 29.698824 | -999.0 | -999.0 | 22.003201 | 1 |
| 731134015411514532 | 143352002608060333 | 200.260821 | 29.458154 | -999.0 | -999.0 | 22.2115 | 1 |
| 731376270565617841 | 144261992830328522 | 199.283076 | 30.223271 | -999.0 | -999.0 | 21.1803 | 1 |
| 731364398762073972 | 143821994989397857 | 199.498968 | 29.856063 | -999.0 | -999.0 | -999.0 | 1 |
| 731358378870915554 | 143771988697657383 | 198.86976 | 29.813933 | 21.095699 | -999.0 | -999.0 | 1 |
| 731356009656165493 | 143691989932453073 | 198.993257 | 29.743755 | 22.583599 | -999.0 | -999.0 | 1 |
| 731358123962301091 | 143791989650376474 | 198.965047 | 29.829922 | -999.0 | 21.5166 | -999.0 | 1 |
| 731583062024405850 | 144881997428337812 | 199.742829 | 30.739355 | -999.0 | -999.0 | -999.0 | 1 |
| 731556046543862818 | 144201996584775344 | 199.658467 | 30.170666 | -999.0 | -999.0 | -999.0 | 1 |
| 731591348034351183 | 144641993066913148 | 199.306691 | 30.535476 | -999.0 | -999.0 | -999.0 | 0 |
20 rows × 7 columns
If you only want to sample from a single partition, use .sample(partition_id, n) instead. This avoids touching any other partition.
[14]:
single_partition_sample = ps1_cone.sample(partition_id=0, n=5, seed=42)
single_partition_sample
[14]:
| objID | raMean | decMean | gMeanPSFMag | rMeanPSFMag | iMeanPSFMag | nDetections | |
|---|---|---|---|---|---|---|---|
| _healpix_29 | |||||||
| 730986934050957759 | 143721996413759898 | 199.641316 | 29.77443 | -999.0 | 17.7243 | -999.0 | 1 |
| 730962212147684672 | 143131994672218264 | 199.467215 | 29.281401 | 18.767401 | -999.0 | -999.0 | 1 |
| 730979986766383795 | 143591997990286372 | 199.799022 | 29.663177 | 21.001101 | -999.0 | -999.0 | 1 |
| 730976899401309098 | 143501996705113072 | 199.670464 | 29.585438 | -999.0 | -999.0 | 21.7351 | 2 |
| 730937483769059421 | 143091997002705155 | 199.70026 | 29.2455 | -999.0 | -999.0 | 21.798201 | 1 |
5 rows × 7 columns
Closing the Dask client#
[15]:
client.close()
About#
Authors: Olivia Lynn
Last updated on: May 18, 2026
If you use lsdb for published research, please cite following instructions.