Small-Scale Analysis#
In this tutorial, we will cover strategies for working with a small slice of a large catalog before committing to a full-scale computation:
narrow the sky area with a spatial filter (e.g.,
cone_search)inspect a single partition with
.partitions[i]filter rows to a manageable subset (e.g., bright stars)
peek at the first few rows of every partition with
map_partitionsdraw a random sample with
.random_sample()
Introduction#
Large astronomical catalogs can contain billions of rows spread across thousands of partitions. Running a pipeline on the full dataset is expensive, and it is easy to waste hours on a bug that could have been caught in seconds on a small slice.
A good workflow starts small:
Narrow the sky: work only in the patch of sky you actually care about.
Inspect one partition: confirm the data looks right before processing everything.
Filter aggressively: drop rows you do not need as early as possible.
Peek at multiple partitions: cheaply verify your function behaves correctly across partition boundaries.
Draw a random sample: get a statistically representative preview without a full compute.
Each technique in this tutorial reduces the amount of data you touch, so you can iterate quickly and scale up only once you are confident in the result.
[1]:
import pandas as pd
import lsdb
from dask.distributed import Client
1. Open a catalog#
Additional Help
For additional information on dask client creation, please refer to the official Dask documentation and our Dask cluster configuration page for LSDB-specific tips. Note that dask also provides its own best practices, which may also be useful to consult.
For tips on accessing remote data, see our Accessing remote data guide
[2]:
client = Client(n_workers=4, memory_limit="auto")
We open the Pan-STARRS1 (PS1) object catalog. The catalog is loaded lazily–no row data is read yet.
[3]:
ps1_object = lsdb.open_catalog("s3://stpubdata/panstarrs/ps1/public/hats/otmo")
ps1_object
[3]:
| decMean | decMeanErr | epochMean | gFlags | gMeanPSFMag | gMeanPSFMagErr | iFlags | iMeanPSFMag | iMeanPSFMagErr | nDetections | ng | ni | nr | ny | nz | objID | objInfoFlag | qualityFlag | raMean | raMeanErr | rFlags | rMeanPSFMag | rMeanPSFMagErr | surveyID | yFlags | yMeanPSFMag | yMeanPSFMagErr | zFlags | zMeanPSFMag | zMeanPSFMagErr | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| npartitions=9577 | ||||||||||||||||||||||||||||||
| Order: 5, Pixel: 0 | double[pyarrow] | double[pyarrow] | double[pyarrow] | int32[pyarrow] | double[pyarrow] | double[pyarrow] | int32[pyarrow] | double[pyarrow] | double[pyarrow] | int16[pyarrow] | int16[pyarrow] | int16[pyarrow] | int16[pyarrow] | int16[pyarrow] | int16[pyarrow] | int64[pyarrow] | int32[pyarrow] | int16[pyarrow] | double[pyarrow] | double[pyarrow] | int32[pyarrow] | double[pyarrow] | double[pyarrow] | int16[pyarrow] | int32[pyarrow] | double[pyarrow] | double[pyarrow] | int32[pyarrow] | double[pyarrow] | double[pyarrow] |
| Order: 5, Pixel: 1 | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| Order: 5, Pixel: 12286 | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| Order: 5, Pixel: 12287 | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Note the number of columns shown.
Some catalogs come with a pre-specified set of “default columns” that will be loaded automatically (unless the columns='all' keyword is specified). We can also always manually specify which columns we’d like to load.
Let’s reduce the amount of data we’ll handle as we work with this catalog.
[4]:
ps1_object = lsdb.open_catalog(
"s3://stpubdata/panstarrs/ps1/public/hats/otmo",
columns=["objID", "raMean", "decMean", "gMeanPSFMag", "rMeanPSFMag", "iMeanPSFMag", "nDetections"],
)
ps1_object
[4]:
| objID | raMean | decMean | gMeanPSFMag | rMeanPSFMag | iMeanPSFMag | nDetections | |
|---|---|---|---|---|---|---|---|
| npartitions=9577 | |||||||
| Order: 5, Pixel: 0 | int64[pyarrow] | double[pyarrow] | double[pyarrow] | double[pyarrow] | double[pyarrow] | double[pyarrow] | int16[pyarrow] |
| Order: 5, Pixel: 1 | ... | ... | ... | ... | ... | ... | ... |
| ... | ... | ... | ... | ... | ... | ... | ... |
| Order: 5, Pixel: 12286 | ... | ... | ... | ... | ... | ... | ... |
| Order: 5, Pixel: 12287 | ... | ... | ... | ... | ... | ... | ... |
2. Region selection#
The simplest way to reduce the amount of data you work with is to restrict the sky area. A cone_search keeps only the partitions that overlap a circle defined by a center (ra, dec) and radius_arcsec.
Starting with a small cone lets you develop and test your pipeline on a tiny fraction of the catalog. Once the pipeline is correct, you can widen the cone or remove it entirely.
[5]:
ps1_cone = ps1_object.cone_search(ra=200.0, dec=30.0, radius_arcsec=1 * 3600)
ps1_cone
[5]:
| objID | raMean | decMean | gMeanPSFMag | rMeanPSFMag | iMeanPSFMag | nDetections | |
|---|---|---|---|---|---|---|---|
| npartitions=4 | |||||||
| Order: 5, Pixel: 2596 | int64[pyarrow] | double[pyarrow] | double[pyarrow] | double[pyarrow] | double[pyarrow] | double[pyarrow] | int16[pyarrow] |
| Order: 5, Pixel: 2597 | ... | ... | ... | ... | ... | ... | ... |
| Order: 5, Pixel: 2598 | ... | ... | ... | ... | ... | ... | ... |
| Order: 5, Pixel: 2599 | ... | ... | ... | ... | ... | ... | ... |
The npartitions has dropped from thousands to a handful, so every subsequent step is much cheaper.
We can also use a pre-built ConeSearch object, which lets us reuse the same region across multiple catalogs.
[6]:
from lsdb import ConeSearch
cone = ConeSearch(ra=200.0, dec=30.0, radius_arcsec=1 * 3600)
ps1_cone = ps1_object.search(cone)
ps1_cone
[6]:
| objID | raMean | decMean | gMeanPSFMag | rMeanPSFMag | iMeanPSFMag | nDetections | |
|---|---|---|---|---|---|---|---|
| npartitions=4 | |||||||
| Order: 5, Pixel: 2596 | int64[pyarrow] | double[pyarrow] | double[pyarrow] | double[pyarrow] | double[pyarrow] | double[pyarrow] | int16[pyarrow] |
| Order: 5, Pixel: 2597 | ... | ... | ... | ... | ... | ... | ... |
| Order: 5, Pixel: 2598 | ... | ... | ... | ... | ... | ... | ... |
| Order: 5, Pixel: 2599 | ... | ... | ... | ... | ... | ... | ... |
3. Partition selection#
Even within a small region you may want to look at a single partition in isolation. Use .partitions[i] to index into the catalog by partition number. The result is a lazy Dask DataFrame for that one partition.
This is useful when you want to call .compute() on just one chunk to inspect values or test a function without touching the rest of the catalog.
[7]:
# Look at the first partition
first_partition = ps1_cone.partitions[0]
first_partition.compute()
[7]:
| objID | raMean | decMean | gMeanPSFMag | rMeanPSFMag | iMeanPSFMag | nDetections | |
|---|---|---|---|---|---|---|---|
| _healpix_29 | |||||||
| 730840911935789259 | 142942005432773466 | 200.543267 | 29.119078 | -999.0 | -999.0 | 22.1049 | 1 |
| 730840958034592453 | 142942005448135918 | 200.544851 | 29.121114 | -999.0 | -999.0 | -999.0 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 730990514047383426 | 143991996861207234 | 199.686118 | 29.99723 | -999.0 | -999.0 | -999.0 | 1 |
| 730990514483634856 | 144001996877830045 | 199.687771 | 29.999566 | -999.0 | -999.0 | 22.3608 | 1 |
181808 rows × 7 columns
You can find the HEALPix pixel that corresponds to a given partition index using get_healpix_pixels().
[8]:
pixels = ps1_cone.get_healpix_pixels()
print(f"All pixels covered: {pixels}")
print(f"Partition 0 covers: {pixels[0]}")
All pixels covered: [Order: 5, Pixel: 2596, Order: 5, Pixel: 2597, Order: 5, Pixel: 2598, Order: 5, Pixel: 2599]
Partition 0 covers: Order: 5, Pixel: 2596
4. Sub-filtering#
Row filters let you trim the data further before any expensive computation. For example, selecting only bright stars reduces the number of rows dramatically and gives you a representative but manageable subset to work with.
[9]:
ps1_cone_and_bright = ps1_cone.query("0 < gMeanPSFMag < 16")
ps1_cone_and_bright
[9]:
| objID | raMean | decMean | gMeanPSFMag | rMeanPSFMag | iMeanPSFMag | nDetections | |
|---|---|---|---|---|---|---|---|
| npartitions=4 | |||||||
| Order: 5, Pixel: 2596 | int64[pyarrow] | double[pyarrow] | double[pyarrow] | double[pyarrow] | double[pyarrow] | double[pyarrow] | int16[pyarrow] |
| Order: 5, Pixel: 2597 | ... | ... | ... | ... | ... | ... | ... |
| Order: 5, Pixel: 2598 | ... | ... | ... | ... | ... | ... | ... |
| Order: 5, Pixel: 2599 | ... | ... | ... | ... | ... | ... | ... |
[10]:
ps1_cone_and_bright.head(5)
[10]:
| objID | raMean | decMean | gMeanPSFMag | rMeanPSFMag | iMeanPSFMag | nDetections | |
|---|---|---|---|---|---|---|---|
| _healpix_29 | |||||||
| 730844140320140186 | 142832001473917106 | 200.147265 | 29.030416 | 14.7094 | 14.0439 | 13.8106 | 55 |
| 730844900543284476 | 142842001903509791 | 200.190449 | 29.04093 | 15.1536 | 14.6864 | 14.5175 | 55 |
| 730845199124677148 | 142912002434617897 | 200.243435 | 29.097786 | 15.3492 | 14.9563 | 14.8328 | 69 |
| 730845991608593391 | 142852003372855574 | 200.337294 | 29.045833 | 15.9104 | 15.4282 | 15.2398 | 68 |
| 730846292154371814 | 142912004225284029 | 200.422788 | 29.09478 | 12.0536 | 11.207 | 10.9045 | 49 |
5 rows × 7 columns
Filters compose: you can chain a spatial filter with a row filter and LSDB will push both into the same pipeline.
5. Peeking at every partition#
map_partitions applies a function to each partition individually. Passing pd.DataFrame.head (or a small wrapper around it) is a cheap way to fetch the first few rows of every partition without loading the full catalog into memory.
This is especially useful for checking that a transformation produces the expected columns and values across all partition boundaries.
[11]:
# Grab the first 3 rows from each partition, then compute
sample_per_partition = ps1_cone.map_partitions(lambda df: df.head(3))
sample_per_partition.compute()
[11]:
| objID | raMean | decMean | gMeanPSFMag | rMeanPSFMag | iMeanPSFMag | nDetections | |
|---|---|---|---|---|---|---|---|
| _healpix_29 | |||||||
| 730840911935789259 | 142942005432773466 | 200.543267 | 29.119078 | -999.0 | -999.0 | 22.1049 | 1 |
| 730840958034592453 | 142942005448135918 | 200.544851 | 29.121114 | -999.0 | -999.0 | -999.0 | 1 |
| 730840958090282340 | 142942005456926339 | 200.545683 | 29.121461 | -999.0 | -999.0 | -999.0 | 1 |
| 731028614286063577 | 142972005930353041 | 200.593027 | 29.143732 | -999.0 | 21.5769 | -999.0 | 1 |
| 731028614545892412 | 142972005940344460 | 200.594009 | 29.144912 | 22.1401 | -999.0 | -999.0 | 1 |
| 731028614842509175 | 142972005928714895 | 200.59283 | 29.145228 | -999.0 | 21.4786 | -999.0 | 2 |
| 731343459159795912 | 143281990771334860 | 199.077096 | 29.40356 | -999.0 | -999.0 | -999.0 | 1 |
| 731343462332759277 | 143281990795017790 | 199.079454 | 29.406004 | 18.150801 | -999.0 | -999.0 | 1 |
| 731343462646587519 | 143281990763998560 | 199.076392 | 29.406652 | -999.0 | 21.8297 | -999.0 | 1 |
| 731553465279730385 | 144001996876174113 | 199.687558 | 30.002921 | 22.3853 | 21.153799 | 20.791201 | 50 |
| 731553465557381929 | 144001996902904284 | 199.690246 | 30.003063 | 21.1381 | -999.0 | -999.0 | 1 |
| 731553465612402991 | 144001996909224697 | 199.690881 | 30.00342 | 20.8773 | -999.0 | -999.0 | 1 |
12 rows × 7 columns
You can pass pd.DataFrame.head directly as the function, along with n as an extra keyword argument.
[12]:
sample_per_partition = ps1_cone.map_partitions(pd.DataFrame.head, n=3)
sample_per_partition.compute()
[12]:
| objID | raMean | decMean | gMeanPSFMag | rMeanPSFMag | iMeanPSFMag | nDetections | |
|---|---|---|---|---|---|---|---|
| _healpix_29 | |||||||
| 730840911935789259 | 142942005432773466 | 200.543267 | 29.119078 | -999.0 | -999.0 | 22.1049 | 1 |
| 730840958034592453 | 142942005448135918 | 200.544851 | 29.121114 | -999.0 | -999.0 | -999.0 | 1 |
| 730840958090282340 | 142942005456926339 | 200.545683 | 29.121461 | -999.0 | -999.0 | -999.0 | 1 |
| 731028614286063577 | 142972005930353041 | 200.593027 | 29.143732 | -999.0 | 21.5769 | -999.0 | 1 |
| 731028614545892412 | 142972005940344460 | 200.594009 | 29.144912 | 22.1401 | -999.0 | -999.0 | 1 |
| 731028614842509175 | 142972005928714895 | 200.59283 | 29.145228 | -999.0 | 21.4786 | -999.0 | 2 |
| 731343459159795912 | 143281990771334860 | 199.077096 | 29.40356 | -999.0 | -999.0 | -999.0 | 1 |
| 731343462332759277 | 143281990795017790 | 199.079454 | 29.406004 | 18.150801 | -999.0 | -999.0 | 1 |
| 731343462646587519 | 143281990763998560 | 199.076392 | 29.406652 | -999.0 | 21.8297 | -999.0 | 1 |
| 731553465279730385 | 144001996876174113 | 199.687558 | 30.002921 | 22.3853 | 21.153799 | 20.791201 | 50 |
| 731553465557381929 | 144001996902904284 | 199.690246 | 30.003063 | 21.1381 | -999.0 | -999.0 | 1 |
| 731553465612402991 | 144001996909224697 | 199.690881 | 30.00342 | 20.8773 | -999.0 | -999.0 | 1 |
12 rows × 7 columns
6. Random sample#
.random_sample(n) draws approximately n rows distributed proportionally across all partitions. Unlike .head(), which always returns rows from the first partitions, a random sample is representative of the whole catalog.
Use .random_sample() when you need a statistical cross-section of the data — for example, to estimate a distribution or spot-check the output of a filter.
Pass a seed for reproducible results.
[13]:
sample = ps1_cone.random_sample(n=20, seed=42)
sample
[13]:
| objID | raMean | decMean | gMeanPSFMag | rMeanPSFMag | iMeanPSFMag | nDetections | |
|---|---|---|---|---|---|---|---|
| _healpix_29 | |||||||
| 730987592464451909 | 143791998474879882 | 199.847488 | 29.832714 | -999.0 | -999.0 | 18.0982 | 1 |
| 730979768191957027 | 143561997693775801 | 199.768993 | 29.637711 | -999.0 | 21.940599 | 21.7822 | 3 |
| 730941714285338748 | 143082000227339558 | 200.022729 | 29.240811 | -999.0 | -999.0 | -999.0 | 1 |
| 731247694065860037 | 144882006466747901 | 200.646649 | 30.739434 | -999.0 | -999.0 | -999.0 | 1 |
| 731243980697046059 | 144802007890585255 | 200.789179 | 30.670585 | -999.0 | 21.8682 | -999.0 | 2 |
| 731241340420419965 | 144692009407408602 | 200.940737 | 30.581694 | -999.0 | -999.0 | 20.7384 | 1 |
| 731146208224591432 | 143792004435211822 | 200.44355 | 29.826019 | -999.0 | -999.0 | 22.076099 | 1 |
| 731192000952225339 | 144472006113641954 | 200.611326 | 30.392848 | -999.0 | -999.0 | -999.0 | 1 |
| 731156616980993426 | 144042009426141715 | 200.942654 | 30.034298 | -999.0 | -999.0 | -999.0 | 1 |
| 731211683143986849 | 144362009918346224 | 200.991829 | 30.304672 | -999.0 | -999.0 | -999.0 | 1 |
| 731144478833693507 | 143622004004660485 | 200.40048 | 29.683231 | -999.0 | -999.0 | -999.0 | 1 |
| 731197395115113484 | 144452003568216440 | 200.356829 | 30.379902 | -999.0 | -999.0 | 21.8801 | 1 |
| 731402936031320483 | 144461991756951160 | 199.175695 | 30.383824 | -999.0 | -999.0 | -999.0 | 1 |
| 731377376507491124 | 144361993567192715 | 199.356719 | 30.30174 | -999.0 | -999.0 | -999.0 | 1 |
| 731372143592409611 | 144051991881616369 | 199.188154 | 30.046474 | -999.0 | 22.187201 | -999.0 | 1 |
| 731633434171266716 | 145172001622245865 | 200.162236 | 30.979387 | -999.0 | -999.0 | 20.2589 | 1 |
| 731555288167844199 | 144201996972151629 | 199.697123 | 30.167505 | -999.0 | -999.0 | -999.0 | 1 |
| 731563861512460543 | 144431996518677399 | 199.651808 | 30.364011 | 21.9706 | -999.0 | -999.0 | 1 |
| 731564147993998431 | 144441995693318690 | 199.569358 | 30.373251 | 20.190201 | 20.1436 | 19.7288 | 16 |
| 731589703962829699 | 144591993472016168 | 199.347229 | 30.496328 | -999.0 | -999.0 | -999.0 | 1 |
20 rows × 7 columns
If you only want to sample from a single partition, use .sample(partition_id, n) instead. This avoids touching any other partition.
[14]:
single_partition_sample = ps1_cone.sample(partition_id=0, n=5, seed=42)
single_partition_sample
[14]:
| objID | raMean | decMean | gMeanPSFMag | rMeanPSFMag | iMeanPSFMag | nDetections | |
|---|---|---|---|---|---|---|---|
| _healpix_29 | |||||||
| 730946389974696972 | 143342002348804486 | 200.234888 | 29.453272 | -999.0 | 22.0823 | -999.0 | 1 |
| 730985888958987501 | 143711995013978543 | 199.501397 | 29.764974 | -999.0 | -999.0 | -999.0 | 1 |
| 730972510685471558 | 143461992692653028 | 199.269275 | 29.552045 | -999.0 | -999.0 | -999.0 | 1 |
| 730934491459203067 | 142891997769558262 | 199.776941 | 29.081396 | -999.0 | -999.0 | -999.0 | 1 |
| 730962221932453679 | 143141994768805025 | 199.476866 | 29.287033 | -999.0 | 21.6182 | -999.0 | 1 |
5 rows × 7 columns
Closing the Dask client#
[15]:
client.close()
About#
Authors: Olivia Lynn
Last updated on: May 18, 2026
If you use lsdb for published research, please cite following instructions.