Joining catalogs#
In this tutorial, we will demonstrate how to perform a JOIN on two catalogs.
Note that this is different from a crossmatch, because the two catalogs share unique identifiers and we can match those values directly. This will still use the spatial properties to perform the JOIN on a per-partition basis.
Introduction#
Gaia is a space telescope that provides excellent astrometric precision, and so is used for determining parallax distances to nearby stars. The parallax distances are not available in all data products, however.
In this notebook, we join Gaia with Gaia Early Data Release 3 (EDR3) and compute the ratio between the distances given by their parallax
and r_med_geo
columns, respectively.
[1]:
import lsdb
from lsdb import ConeSearch
1. Load the catalogs#
First we load Gaia with its objects source_id
, their positions and parallax
columns.
[2]:
gaia = lsdb.read_hats(
"https://data.lsdb.io/hats/gaia_dr3/gaia",
margin_cache="https://data.lsdb.io/hats/gaia_dr3/gaia_10arcs",
columns=["source_id", "ra", "dec", "parallax"],
search_filter=ConeSearch(ra=0, dec=0, radius_arcsec=10 * 3600),
)
gaia
[2]:
source_id | ra | dec | parallax | |
---|---|---|---|---|
npartitions=4 | ||||
Order: 2, Pixel: 67 | int64[pyarrow] | double[pyarrow] | double[pyarrow] | double[pyarrow] |
Order: 2, Pixel: 70 | ... | ... | ... | ... |
Order: 2, Pixel: 73 | ... | ... | ... | ... |
Order: 2, Pixel: 76 | ... | ... | ... | ... |
We will do the same with Gaia EDR3 but the distance column we will use is called r_med_geo
, the median of the geometric distance estimate.
[3]:
gaia_edr3 = lsdb.read_hats(
"https://data.lsdb.io/hats/gaia_dr3/gaia_edr3_distances",
margin_cache="https://data.lsdb.io/hats/gaia_dr3/gaia_edr3_distances_10arcs",
columns=["source_id", "ra", "dec", "r_med_geo"],
search_filter=ConeSearch(ra=0, dec=0, radius_arcsec=10 * 3600),
)
gaia_edr3
[3]:
source_id | ra | dec | r_med_geo | |
---|---|---|---|---|
npartitions=4 | ||||
Order: 2, Pixel: 67 | int64[pyarrow] | double[pyarrow] | double[pyarrow] | double[pyarrow] |
Order: 2, Pixel: 70 | ... | ... | ... | ... |
Order: 2, Pixel: 73 | ... | ... | ... | ... |
Order: 2, Pixel: 76 | ... | ... | ... | ... |
2. Join Operation#
We are now able to join both catalogs on the source_id
column, as follows:
[4]:
joined = gaia.join(gaia_edr3, left_on="source_id", right_on="source_id")
joined
[4]:
source_id_gaia | ra_gaia | dec_gaia | parallax_gaia | source_id_gaia_edr3_distances | ra_gaia_edr3_distances | dec_gaia_edr3_distances | r_med_geo_gaia_edr3_distances | |
---|---|---|---|---|---|---|---|---|
npartitions=4 | ||||||||
Order: 2, Pixel: 67 | int64[pyarrow] | double[pyarrow] | double[pyarrow] | double[pyarrow] | int64[pyarrow] | double[pyarrow] | double[pyarrow] | double[pyarrow] |
Order: 2, Pixel: 70 | ... | ... | ... | ... | ... | ... | ... | ... |
Order: 2, Pixel: 73 | ... | ... | ... | ... | ... | ... | ... | ... |
Order: 2, Pixel: 76 | ... | ... | ... | ... | ... | ... | ... | ... |
3. Joint analysis#
Let’s calculate a histogram with the ratio in catalog distances.
[5]:
results = (1e3 / joined["parallax_gaia"]) / joined["r_med_geo_gaia_edr3_distances"]
ratios = results.compute().to_numpy()
ratios
[5]:
array([ 1.04369179, 0.99821673, 0.96884423, ..., 1.03218566,
147.42381857, 1.41197562])
[6]:
import numpy as np
import matplotlib.pyplot as plt
plt.hist(ratios, bins=np.linspace(0.8, 1.2, 100))
plt.title("Histogram of Gaia distance / Gaia EDR3 distance")
plt.show()

About#
Authors: Sandro Campos
Last updated on: April 17, 2025
If you use lsdb
for published research, please cite following instructions.