Joining catalogs#

In this tutorial, we will demonstrate how to perform a JOIN on two catalogs.

Note that this is different from a crossmatch, because the two catalogs share unique identifiers and we can match those values directly. This will still use the spatial properties to perform the JOIN on a per-partition basis.

Introduction#

Gaia is a space telescope that provides excellent astrometric precision, and so is used for determining parallax distances to nearby stars. The parallax distances are not available in all data products, however.

In this notebook, we join Gaia with Gaia Early Data Release 3 (EDR3) and compute the ratio between the distances given by their parallax and r_med_geo columns, respectively.

[1]:
import lsdb
from lsdb import ConeSearch

1. Load the catalogs#

First we load Gaia with its objects source_id, their positions and parallax columns.

[2]:
gaia = lsdb.read_hats(
    "https://data.lsdb.io/hats/gaia_dr3/gaia",
    margin_cache="https://data.lsdb.io/hats/gaia_dr3/gaia_10arcs",
    columns=["source_id", "ra", "dec", "parallax"],
    search_filter=ConeSearch(ra=0, dec=0, radius_arcsec=10 * 3600),
)
gaia
[2]:
lsdb Catalog gaia:
source_id ra dec parallax
npartitions=4
Order: 2, Pixel: 67 int64[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow]
Order: 2, Pixel: 70 ... ... ... ...
Order: 2, Pixel: 73 ... ... ... ...
Order: 2, Pixel: 76 ... ... ... ...
The catalog has been loaded lazily, meaning no data has been read, only the catalog schema

We will do the same with Gaia EDR3 but the distance column we will use is called r_med_geo, the median of the geometric distance estimate.

[3]:
gaia_edr3 = lsdb.read_hats(
    "https://data.lsdb.io/hats/gaia_dr3/gaia_edr3_distances",
    margin_cache="https://data.lsdb.io/hats/gaia_dr3/gaia_edr3_distances_10arcs",
    columns=["source_id", "ra", "dec", "r_med_geo"],
    search_filter=ConeSearch(ra=0, dec=0, radius_arcsec=10 * 3600),
)
gaia_edr3
[3]:
lsdb Catalog gaia_edr3_distances:
source_id ra dec r_med_geo
npartitions=4
Order: 2, Pixel: 67 int64[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow]
Order: 2, Pixel: 70 ... ... ... ...
Order: 2, Pixel: 73 ... ... ... ...
Order: 2, Pixel: 76 ... ... ... ...
The catalog has been loaded lazily, meaning no data has been read, only the catalog schema

2. Join Operation#

We are now able to join both catalogs on the source_id column, as follows:

[4]:
joined = gaia.join(gaia_edr3, left_on="source_id", right_on="source_id")
joined
[4]:
lsdb Catalog gaia:
source_id_gaia ra_gaia dec_gaia parallax_gaia source_id_gaia_edr3_distances ra_gaia_edr3_distances dec_gaia_edr3_distances r_med_geo_gaia_edr3_distances
npartitions=4
Order: 2, Pixel: 67 int64[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] int64[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow]
Order: 2, Pixel: 70 ... ... ... ... ... ... ... ...
Order: 2, Pixel: 73 ... ... ... ... ... ... ... ...
Order: 2, Pixel: 76 ... ... ... ... ... ... ... ...
The catalog has been loaded lazily, meaning no data has been read, only the catalog schema

3. Joint analysis#

Let’s calculate a histogram with the ratio in catalog distances.

[5]:
results = (1e3 / joined["parallax_gaia"]) / joined["r_med_geo_gaia_edr3_distances"]
ratios = results.compute().to_numpy()
ratios
[5]:
array([  1.04369179,   0.99821673,   0.96884423, ...,   1.03218566,
       147.42381857,   1.41197562])
[6]:
import numpy as np
import matplotlib.pyplot as plt

plt.hist(ratios, bins=np.linspace(0.8, 1.2, 100))
plt.title("Histogram of Gaia distance / Gaia EDR3 distance")
plt.show()
../../_images/tutorials_pre_executed_join_catalogs_10_0.png

About#

Authors: Sandro Campos

Last updated on: April 17, 2025

If you use lsdb for published research, please cite following instructions.