Getting data into LSDB#

The most practical way to load data into LSDB is from catalogs in HATS format, hosted locally or on a remote source. We recommend you to visit our own cloud repository, data.lsdb.io, where you are able to find large surveys publicly available to use. If you’re looking for how to get external data into LSDB, see the topic Import Catalogs instead.

[1]:
import lsdb

Example: Loading Gaia DR3#

Let’s get Gaia DR3 into our workflow, as an example. It is as simple as invoking read_hats with the respective catalog URL, which you can copy directly from our website.

[2]:
gaia_dr3 = lsdb.read_hats("https://data.lsdb.io/hats/gaia_dr3/gaia/")
gaia_dr3
[2]:
lsdb Catalog gaia:
solution_id designation source_id ref_epoch ra ra_error dec dec_error parallax parallax_error pm pmra pmra_error pmdec pmdec_error phot_g_n_obs phot_g_mean_flux phot_g_mean_flux_error phot_g_mean_mag phot_bp_n_obs phot_bp_mean_flux phot_bp_mean_flux_error phot_bp_mean_mag phot_rp_n_obs phot_rp_mean_flux phot_rp_mean_flux_error phot_rp_mean_mag
npartitions=3933
Order: 2, Pixel: 0 int64[pyarrow] string[pyarrow] int64[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] int64[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] int64[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] int64[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow]
Order: 3, Pixel: 4 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Order: 4, Pixel: 3067 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Order: 3, Pixel: 767 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
The catalog has been loaded lazily, meaning no data has been read, only the catalog schema

The Gaia catalog is very wide so you would be requesting its whole set of >150 columns.

[3]:
gaia_dr3.columns
[3]:
Index(['solution_id', 'designation', 'source_id', 'ref_epoch', 'ra',
       'ra_error', 'dec', 'dec_error', 'parallax', 'parallax_error', 'pm',
       'pmra', 'pmra_error', 'pmdec', 'pmdec_error', 'phot_g_n_obs',
       'phot_g_mean_flux', 'phot_g_mean_flux_error', 'phot_g_mean_mag',
       'phot_bp_n_obs', 'phot_bp_mean_flux', 'phot_bp_mean_flux_error',
       'phot_bp_mean_mag', 'phot_rp_n_obs', 'phot_rp_mean_flux',
       'phot_rp_mean_flux_error', 'phot_rp_mean_mag'],
      dtype='object')

Note that it’s important (and highly recommended) to:

  • Pre-select a small subset of columns that satisfies your scientific needs. Loading an unnecessarily large amount of data leads to computationally expensive and inefficient workflows. To see which columns are available before even having to invoke read_hats, please refer to the column descriptions in each catalog’s section on data.lsdb.io.

  • Load catalogs with their respective margin caches, when available. These margins are necessary to obtain accurate results in several operations such as joining and crossmatching. For more information about margins please visit our Margins topic notebook.

Let’s define the set of columns we need and add the margin catalog’s path to our read_hats call.

[4]:
gaia_dr3 = lsdb.read_hats(
    "https://data.lsdb.io/hats/gaia_dr3/gaia/",
    margin_cache="https://data.lsdb.io/hats/gaia_dr3/gaia_10arcs/",
    columns=[
        "source_id",
        "ra",
        "dec",
        "phot_g_mean_mag",
        "phot_proc_mode",
        "azero_gspphot",
        "classprob_dsc_combmod_star",
    ],
)
gaia_dr3
[4]:
lsdb Catalog gaia:
source_id ra dec phot_g_mean_mag phot_proc_mode azero_gspphot classprob_dsc_combmod_star
npartitions=3933
Order: 2, Pixel: 0 int64[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow]
Order: 3, Pixel: 4 ... ... ... ... ... ... ...
... ... ... ... ... ... ... ...
Order: 4, Pixel: 3067 ... ... ... ... ... ... ...
Order: 3, Pixel: 767 ... ... ... ... ... ... ...
The catalog has been loaded lazily, meaning no data has been read, only the catalog schema

Data loading is lazy#

When invoking read_hats, only metadata information about that catalog (e.g. sky coverage, number of total rows, and column schema) is loaded into memory! Notice that the ellipses in the previous catalog representation are just placeholders.

You will find that most use cases start with LAZY loading and planning operations, followed by more expensive COMPUTE operations. The data is only loaded into memory when we trigger the workflow computations, usually with a compute call.

Lazy workflow diagram

Visualizing catalog metadata#

Even without loading any data, you can still get a glimpse of our catalog’s structure.

HEALPix map#

You can use plot_pixels to observe the catalog’s sky coverage map and obtain information about its HEALPix distribution. Areas of higher density of points are represented by higher order pixels.

[5]:
gaia_dr3.plot_pixels(plot_title="Gaia DR3 Pixel Map")
[5]:
(<Figure size 1000x500 with 2 Axes>,
 <WCSAxes: title={'center': 'Gaia DR3 Pixel Map'}>)
../_images/tutorials_getting_data_12_1.png

Column schema#

It is also straightforward to have a look at column names and their respective types.

[6]:
gaia_dr3.dtypes
[6]:
source_id                      int64[pyarrow]
ra                            double[pyarrow]
dec                           double[pyarrow]
phot_g_mean_mag               double[pyarrow]
phot_proc_mode                double[pyarrow]
azero_gspphot                 double[pyarrow]
classprob_dsc_combmod_star    double[pyarrow]
dtype: object