Small-Scale Analysis#

In this tutorial, we will cover strategies for working with a small slice of a large catalog before committing to a full-scale computation:

  • narrow the sky area with a spatial filter (e.g., cone_search)

  • inspect a single partition with .partitions[i]

  • filter rows to a manageable subset (e.g., bright stars)

  • peek at the first few rows of every partition with map_partitions

  • draw a random sample with .random_sample()

Introduction#

Large astronomical catalogs can contain billions of rows spread across thousands of partitions. Running a pipeline on the full dataset is expensive, and it is easy to waste hours on a bug that could have been caught in seconds on a small slice.

A good workflow starts small:

  1. Narrow the sky: work only in the patch of sky you actually care about.

  2. Inspect one partition: confirm the data looks right before processing everything.

  3. Filter aggressively: drop rows you do not need as early as possible.

  4. Peek at multiple partitions: cheaply verify your function behaves correctly across partition boundaries.

  5. Draw a random sample: get a statistically representative preview without a full compute.

Each technique in this tutorial reduces the amount of data you touch, so you can iterate quickly and scale up only once you are confident in the result.

[1]:
import pandas as pd

import lsdb
from dask.distributed import Client

1. Open a catalog#

Additional Help

For additional information on dask client creation, please refer to the official Dask documentation and our Dask cluster configuration page for LSDB-specific tips. Note that dask also provides its own best practices, which may also be useful to consult.

For tips on accessing remote data, see our Accessing remote data guide

[2]:
client = Client(n_workers=4, memory_limit="auto")

We open the Pan-STARRS1 (PS1) object catalog. The catalog is loaded lazily–no row data is read yet.

[3]:
ps1_object = lsdb.open_catalog("s3://stpubdata/panstarrs/ps1/public/hats/otmo")
ps1_object
[3]:
lsdb Catalog otmo:
decMean decMeanErr epochMean gFlags gMeanPSFMag gMeanPSFMagErr iFlags iMeanPSFMag iMeanPSFMagErr nDetections ng ni nr ny nz objID objInfoFlag qualityFlag raMean raMeanErr rFlags rMeanPSFMag rMeanPSFMagErr surveyID yFlags yMeanPSFMag yMeanPSFMagErr zFlags zMeanPSFMag zMeanPSFMagErr
npartitions=9577
Order: 5, Pixel: 0 double[pyarrow] double[pyarrow] double[pyarrow] int32[pyarrow] double[pyarrow] double[pyarrow] int32[pyarrow] double[pyarrow] double[pyarrow] int16[pyarrow] int16[pyarrow] int16[pyarrow] int16[pyarrow] int16[pyarrow] int16[pyarrow] int64[pyarrow] int32[pyarrow] int16[pyarrow] double[pyarrow] double[pyarrow] int32[pyarrow] double[pyarrow] double[pyarrow] int16[pyarrow] int32[pyarrow] double[pyarrow] double[pyarrow] int32[pyarrow] double[pyarrow] double[pyarrow]
Order: 5, Pixel: 1 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Order: 5, Pixel: 12286 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Order: 5, Pixel: 12287 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
30 out of 131 available columns in the catalog have been loaded lazily, meaning no data has been read, only the catalog schema
This catalog has an estimated size of 1.9 TB

Note the number of columns shown.

Some catalogs come with a pre-specified set of “default columns” that will be loaded automatically (unless the columns='all' keyword is specified). We can also always manually specify which columns we’d like to load.

Let’s reduce the amount of data we’ll handle as we work with this catalog.

[4]:
ps1_object = lsdb.open_catalog(
    "s3://stpubdata/panstarrs/ps1/public/hats/otmo",
    columns=["objID", "raMean", "decMean", "gMeanPSFMag", "rMeanPSFMag", "iMeanPSFMag", "nDetections"],
)
ps1_object
[4]:
lsdb Catalog otmo:
objID raMean decMean gMeanPSFMag rMeanPSFMag iMeanPSFMag nDetections
npartitions=9577
Order: 5, Pixel: 0 int64[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] int16[pyarrow]
Order: 5, Pixel: 1 ... ... ... ... ... ... ...
... ... ... ... ... ... ... ...
Order: 5, Pixel: 12286 ... ... ... ... ... ... ...
Order: 5, Pixel: 12287 ... ... ... ... ... ... ...
7 out of 131 available columns in the catalog have been loaded lazily, meaning no data has been read, only the catalog schema
This catalog has an estimated size of 612.5 GB

2. Region selection#

The simplest way to reduce the amount of data you work with is to restrict the sky area. A cone_search keeps only the partitions that overlap a circle defined by a center (ra, dec) and radius_arcsec.

Starting with a small cone lets you develop and test your pipeline on a tiny fraction of the catalog. Once the pipeline is correct, you can widen the cone or remove it entirely.

[5]:
ps1_cone = ps1_object.cone_search(ra=200.0, dec=30.0, radius_arcsec=1 * 3600)
ps1_cone
[5]:
lsdb Catalog otmo:
objID raMean decMean gMeanPSFMag rMeanPSFMag iMeanPSFMag nDetections
npartitions=4
Order: 5, Pixel: 2596 int64[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] int16[pyarrow]
Order: 5, Pixel: 2597 ... ... ... ... ... ... ...
Order: 5, Pixel: 2598 ... ... ... ... ... ... ...
Order: 5, Pixel: 2599 ... ... ... ... ... ... ...
7 out of 131 available columns in the catalog have been loaded lazily, meaning no data has been read, only the catalog schema
This catalog has an estimated size of 255.8 MB

The npartitions has dropped from thousands to a handful, so every subsequent step is much cheaper.

We can also use a pre-built ConeSearch object, which lets us reuse the same region across multiple catalogs.

[6]:
from lsdb import ConeSearch

cone = ConeSearch(ra=200.0, dec=30.0, radius_arcsec=1 * 3600)
ps1_cone = ps1_object.search(cone)
ps1_cone
[6]:
lsdb Catalog otmo:
objID raMean decMean gMeanPSFMag rMeanPSFMag iMeanPSFMag nDetections
npartitions=4
Order: 5, Pixel: 2596 int64[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] int16[pyarrow]
Order: 5, Pixel: 2597 ... ... ... ... ... ... ...
Order: 5, Pixel: 2598 ... ... ... ... ... ... ...
Order: 5, Pixel: 2599 ... ... ... ... ... ... ...
7 out of 131 available columns in the catalog have been loaded lazily, meaning no data has been read, only the catalog schema
This catalog has an estimated size of 255.8 MB

3. Partition selection#

Even within a small region you may want to look at a single partition in isolation. Use .partitions[i] to index into the catalog by partition number. The result is a lazy Dask DataFrame for that one partition.

This is useful when you want to call .compute() on just one chunk to inspect values or test a function without touching the rest of the catalog.

[7]:
# Look at the first partition
first_partition = ps1_cone.partitions[0]
first_partition.compute()
[7]:
objID raMean decMean gMeanPSFMag rMeanPSFMag iMeanPSFMag nDetections
_healpix_29
730840911935789259 142942005432773466 200.543267 29.119078 -999.0 -999.0 22.1049 1
730840958034592453 142942005448135918 200.544851 29.121114 -999.0 -999.0 -999.0 1
... ... ... ... ... ... ... ...
730990514047383426 143991996861207234 199.686118 29.99723 -999.0 -999.0 -999.0 1
730990514483634856 144001996877830045 199.687771 29.999566 -999.0 -999.0 22.3608 1

181808 rows × 7 columns

You can find the HEALPix pixel that corresponds to a given partition index using get_healpix_pixels().

[8]:
pixels = ps1_cone.get_healpix_pixels()

print(f"All pixels covered: {pixels}")
print(f"Partition 0 covers: {pixels[0]}")
All pixels covered: [Order: 5, Pixel: 2596, Order: 5, Pixel: 2597, Order: 5, Pixel: 2598, Order: 5, Pixel: 2599]
Partition 0 covers: Order: 5, Pixel: 2596

4. Sub-filtering#

Row filters let you trim the data further before any expensive computation. For example, selecting only bright stars reduces the number of rows dramatically and gives you a representative but manageable subset to work with.

[9]:
ps1_cone_and_bright = ps1_cone.query("0 < gMeanPSFMag < 16")
ps1_cone_and_bright
[9]:
lsdb Catalog otmo:
objID raMean decMean gMeanPSFMag rMeanPSFMag iMeanPSFMag nDetections
npartitions=4
Order: 5, Pixel: 2596 int64[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] int16[pyarrow]
Order: 5, Pixel: 2597 ... ... ... ... ... ... ...
Order: 5, Pixel: 2598 ... ... ... ... ... ... ...
Order: 5, Pixel: 2599 ... ... ... ... ... ... ...
7 out of 131 available columns in the catalog have been loaded lazily, meaning no data has been read, only the catalog schema
This catalog has an estimated size of 255.8 MB
[10]:
ps1_cone_and_bright.head(5)
[10]:
objID raMean decMean gMeanPSFMag rMeanPSFMag iMeanPSFMag nDetections
_healpix_29
730844140320140186 142832001473917106 200.147265 29.030416 14.7094 14.0439 13.8106 55
730844900543284476 142842001903509791 200.190449 29.04093 15.1536 14.6864 14.5175 55
730845199124677148 142912002434617897 200.243435 29.097786 15.3492 14.9563 14.8328 69
730845991608593391 142852003372855574 200.337294 29.045833 15.9104 15.4282 15.2398 68
730846292154371814 142912004225284029 200.422788 29.09478 12.0536 11.207 10.9045 49

5 rows × 7 columns

Filters compose: you can chain a spatial filter with a row filter and LSDB will push both into the same pipeline.

5. Peeking at every partition#

map_partitions applies a function to each partition individually. Passing pd.DataFrame.head (or a small wrapper around it) is a cheap way to fetch the first few rows of every partition without loading the full catalog into memory.

This is especially useful for checking that a transformation produces the expected columns and values across all partition boundaries.

[11]:
# Grab the first 3 rows from each partition, then compute
sample_per_partition = ps1_cone.map_partitions(lambda df: df.head(3))
sample_per_partition.compute()
[11]:
objID raMean decMean gMeanPSFMag rMeanPSFMag iMeanPSFMag nDetections
_healpix_29
730840911935789259 142942005432773466 200.543267 29.119078 -999.0 -999.0 22.1049 1
730840958034592453 142942005448135918 200.544851 29.121114 -999.0 -999.0 -999.0 1
730840958090282340 142942005456926339 200.545683 29.121461 -999.0 -999.0 -999.0 1
731028614286063577 142972005930353041 200.593027 29.143732 -999.0 21.5769 -999.0 1
731028614545892412 142972005940344460 200.594009 29.144912 22.1401 -999.0 -999.0 1
731028614842509175 142972005928714895 200.59283 29.145228 -999.0 21.4786 -999.0 2
731343459159795912 143281990771334860 199.077096 29.40356 -999.0 -999.0 -999.0 1
731343462332759277 143281990795017790 199.079454 29.406004 18.150801 -999.0 -999.0 1
731343462646587519 143281990763998560 199.076392 29.406652 -999.0 21.8297 -999.0 1
731553465279730385 144001996876174113 199.687558 30.002921 22.3853 21.153799 20.791201 50
731553465557381929 144001996902904284 199.690246 30.003063 21.1381 -999.0 -999.0 1
731553465612402991 144001996909224697 199.690881 30.00342 20.8773 -999.0 -999.0 1

12 rows × 7 columns

You can pass pd.DataFrame.head directly as the function, along with n as an extra keyword argument.

[12]:
sample_per_partition = ps1_cone.map_partitions(pd.DataFrame.head, n=3)
sample_per_partition.compute()
[12]:
objID raMean decMean gMeanPSFMag rMeanPSFMag iMeanPSFMag nDetections
_healpix_29
730840911935789259 142942005432773466 200.543267 29.119078 -999.0 -999.0 22.1049 1
730840958034592453 142942005448135918 200.544851 29.121114 -999.0 -999.0 -999.0 1
730840958090282340 142942005456926339 200.545683 29.121461 -999.0 -999.0 -999.0 1
731028614286063577 142972005930353041 200.593027 29.143732 -999.0 21.5769 -999.0 1
731028614545892412 142972005940344460 200.594009 29.144912 22.1401 -999.0 -999.0 1
731028614842509175 142972005928714895 200.59283 29.145228 -999.0 21.4786 -999.0 2
731343459159795912 143281990771334860 199.077096 29.40356 -999.0 -999.0 -999.0 1
731343462332759277 143281990795017790 199.079454 29.406004 18.150801 -999.0 -999.0 1
731343462646587519 143281990763998560 199.076392 29.406652 -999.0 21.8297 -999.0 1
731553465279730385 144001996876174113 199.687558 30.002921 22.3853 21.153799 20.791201 50
731553465557381929 144001996902904284 199.690246 30.003063 21.1381 -999.0 -999.0 1
731553465612402991 144001996909224697 199.690881 30.00342 20.8773 -999.0 -999.0 1

12 rows × 7 columns

6. Random sample#

.random_sample(n) draws approximately n rows distributed proportionally across all partitions. Unlike .head(), which always returns rows from the first partitions, a random sample is representative of the whole catalog.

Use .random_sample() when you need a statistical cross-section of the data — for example, to estimate a distribution or spot-check the output of a filter.

Pass a seed for reproducible results.

[13]:
sample = ps1_cone.random_sample(n=20, seed=42)
sample
[13]:
objID raMean decMean gMeanPSFMag rMeanPSFMag iMeanPSFMag nDetections
_healpix_29
730987592464451909 143791998474879882 199.847488 29.832714 -999.0 -999.0 18.0982 1
730979768191957027 143561997693775801 199.768993 29.637711 -999.0 21.940599 21.7822 3
730941714285338748 143082000227339558 200.022729 29.240811 -999.0 -999.0 -999.0 1
731247694065860037 144882006466747901 200.646649 30.739434 -999.0 -999.0 -999.0 1
731243980697046059 144802007890585255 200.789179 30.670585 -999.0 21.8682 -999.0 2
731241340420419965 144692009407408602 200.940737 30.581694 -999.0 -999.0 20.7384 1
731146208224591432 143792004435211822 200.44355 29.826019 -999.0 -999.0 22.076099 1
731192000952225339 144472006113641954 200.611326 30.392848 -999.0 -999.0 -999.0 1
731156616980993426 144042009426141715 200.942654 30.034298 -999.0 -999.0 -999.0 1
731211683143986849 144362009918346224 200.991829 30.304672 -999.0 -999.0 -999.0 1
731144478833693507 143622004004660485 200.40048 29.683231 -999.0 -999.0 -999.0 1
731197395115113484 144452003568216440 200.356829 30.379902 -999.0 -999.0 21.8801 1
731402936031320483 144461991756951160 199.175695 30.383824 -999.0 -999.0 -999.0 1
731377376507491124 144361993567192715 199.356719 30.30174 -999.0 -999.0 -999.0 1
731372143592409611 144051991881616369 199.188154 30.046474 -999.0 22.187201 -999.0 1
731633434171266716 145172001622245865 200.162236 30.979387 -999.0 -999.0 20.2589 1
731555288167844199 144201996972151629 199.697123 30.167505 -999.0 -999.0 -999.0 1
731563861512460543 144431996518677399 199.651808 30.364011 21.9706 -999.0 -999.0 1
731564147993998431 144441995693318690 199.569358 30.373251 20.190201 20.1436 19.7288 16
731589703962829699 144591993472016168 199.347229 30.496328 -999.0 -999.0 -999.0 1

20 rows × 7 columns

If you only want to sample from a single partition, use .sample(partition_id, n) instead. This avoids touching any other partition.

[14]:
single_partition_sample = ps1_cone.sample(partition_id=0, n=5, seed=42)
single_partition_sample
[14]:
objID raMean decMean gMeanPSFMag rMeanPSFMag iMeanPSFMag nDetections
_healpix_29
730946389974696972 143342002348804486 200.234888 29.453272 -999.0 22.0823 -999.0 1
730985888958987501 143711995013978543 199.501397 29.764974 -999.0 -999.0 -999.0 1
730972510685471558 143461992692653028 199.269275 29.552045 -999.0 -999.0 -999.0 1
730934491459203067 142891997769558262 199.776941 29.081396 -999.0 -999.0 -999.0 1
730962221932453679 143141994768805025 199.476866 29.287033 -999.0 21.6182 -999.0 1

5 rows × 7 columns

Closing the Dask client#

[15]:
client.close()

About#

Authors: Olivia Lynn

Last updated on: May 18, 2026

If you use lsdb for published research, please cite following instructions.