Getting data into LSDB#
The most practical way to load data into LSDB is from catalogs in HATS format, hosted locally or on a remote source. We recommend you to visit our own cloud repository, data.lsdb.io, where you are able to find large surveys publicly available to use. If you’re looking for how to get external data into LSDB, see the topic Import Catalogs instead.
[1]:
import lsdb
Example: Loading Gaia DR3#
Let’s get Gaia DR3 into our workflow, as an example. It is as simple as invoking read_hats
with the respective catalog URL, which you can copy directly from our website.
[2]:
gaia_dr3 = lsdb.read_hats("https://data.lsdb.io/hats/gaia_dr3/gaia/")
gaia_dr3
[2]:
solution_id | designation | source_id | ref_epoch | ra | ra_error | dec | dec_error | parallax | parallax_error | pm | pmra | pmra_error | pmdec | pmdec_error | phot_g_n_obs | phot_g_mean_flux | phot_g_mean_flux_error | phot_g_mean_mag | phot_bp_n_obs | phot_bp_mean_flux | phot_bp_mean_flux_error | phot_bp_mean_mag | phot_rp_n_obs | phot_rp_mean_flux | phot_rp_mean_flux_error | phot_rp_mean_mag | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
npartitions=3933 | |||||||||||||||||||||||||||
Order: 2, Pixel: 0 | int64[pyarrow] | string[pyarrow] | int64[pyarrow] | double[pyarrow] | double[pyarrow] | double[pyarrow] | double[pyarrow] | double[pyarrow] | double[pyarrow] | double[pyarrow] | double[pyarrow] | double[pyarrow] | double[pyarrow] | double[pyarrow] | double[pyarrow] | int64[pyarrow] | double[pyarrow] | double[pyarrow] | double[pyarrow] | int64[pyarrow] | double[pyarrow] | double[pyarrow] | double[pyarrow] | int64[pyarrow] | double[pyarrow] | double[pyarrow] | double[pyarrow] |
Order: 3, Pixel: 4 | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Order: 4, Pixel: 3067 | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Order: 3, Pixel: 767 | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
The Gaia catalog is very wide so you would be requesting its whole set of >150 columns.
[3]:
gaia_dr3.columns
[3]:
Index(['solution_id', 'designation', 'source_id', 'ref_epoch', 'ra',
'ra_error', 'dec', 'dec_error', 'parallax', 'parallax_error', 'pm',
'pmra', 'pmra_error', 'pmdec', 'pmdec_error', 'phot_g_n_obs',
'phot_g_mean_flux', 'phot_g_mean_flux_error', 'phot_g_mean_mag',
'phot_bp_n_obs', 'phot_bp_mean_flux', 'phot_bp_mean_flux_error',
'phot_bp_mean_mag', 'phot_rp_n_obs', 'phot_rp_mean_flux',
'phot_rp_mean_flux_error', 'phot_rp_mean_mag'],
dtype='object')
Note that it’s important (and highly recommended) to:
Pre-select a small subset of columns that satisfies your scientific needs. Loading an unnecessarily large amount of data leads to computationally expensive and inefficient workflows. To see which columns are available before even having to invoke
read_hats
, please refer to the column descriptions in each catalog’s section on data.lsdb.io.Load catalogs with their respective margin caches, when available. These margins are necessary to obtain accurate results in several operations such as joining and crossmatching. For more information about margins please visit our Margins topic notebook.
Let’s define the set of columns we need and add the margin catalog’s path to our read_hats
call.
[4]:
gaia_dr3 = lsdb.read_hats(
"https://data.lsdb.io/hats/gaia_dr3/gaia/",
margin_cache="https://data.lsdb.io/hats/gaia_dr3/gaia_10arcs/",
columns=[
"source_id",
"ra",
"dec",
"phot_g_mean_mag",
"phot_proc_mode",
"azero_gspphot",
"classprob_dsc_combmod_star",
],
)
gaia_dr3
[4]:
source_id | ra | dec | phot_g_mean_mag | phot_proc_mode | azero_gspphot | classprob_dsc_combmod_star | |
---|---|---|---|---|---|---|---|
npartitions=3933 | |||||||
Order: 2, Pixel: 0 | int64[pyarrow] | double[pyarrow] | double[pyarrow] | double[pyarrow] | double[pyarrow] | double[pyarrow] | double[pyarrow] |
Order: 3, Pixel: 4 | ... | ... | ... | ... | ... | ... | ... |
... | ... | ... | ... | ... | ... | ... | ... |
Order: 4, Pixel: 3067 | ... | ... | ... | ... | ... | ... | ... |
Order: 3, Pixel: 767 | ... | ... | ... | ... | ... | ... | ... |
Data loading is lazy#
When invoking read_hats
, only metadata information about that catalog (e.g. sky coverage, number of total rows, and column schema) is loaded into memory! Notice that the ellipses in the previous catalog representation are just placeholders.
You will find that most use cases start with LAZY loading and planning operations, followed by more expensive COMPUTE operations. The data is only loaded into memory when we trigger the workflow computations, usually with a compute
call.
Visualizing catalog metadata#
Even without loading any data, you can still get a glimpse of our catalog’s structure.
HEALPix map#
You can use plot_pixels
to observe the catalog’s sky coverage map and obtain information about its HEALPix distribution. Areas of higher density of points are represented by higher order pixels.
[5]:
gaia_dr3.plot_pixels(plot_title="Gaia DR3 Pixel Map")
[5]:
(<Figure size 1000x500 with 2 Axes>,
<WCSAxes: title={'center': 'Gaia DR3 Pixel Map'}>)

Column schema#
It is also straightforward to have a look at column names and their respective types.
[6]:
gaia_dr3.dtypes
[6]:
source_id int64[pyarrow]
ra double[pyarrow]
dec double[pyarrow]
phot_g_mean_mag double[pyarrow]
phot_proc_mode double[pyarrow]
azero_gspphot double[pyarrow]
classprob_dsc_combmod_star double[pyarrow]
dtype: object