Joining catalogs

Joining catalogs#

In this tutorial we join a small cone region of Gaia with Gaia Early Data Release 3 (EDR3) and compute the ratio between the distances given by their parallax and r_med_geo columns, respectively.

[1]:
import lsdb
from lsdb import ConeSearch

First we load Gaia with its objects source_id, their positions and parallax columns.

[2]:
gaia = lsdb.read_hats(
    "https://data.lsdb.io/hats/gaia_dr3/gaia",
    margin_cache="https://data.lsdb.io/hats/gaia_dr3/gaia_10arcs",
    columns=["source_id", "ra", "dec", "parallax"],
    search_filter=ConeSearch(ra=0, dec=0, radius_arcsec=10 * 3600),
)
gaia
[2]:
lsdb Catalog gaia:
source_id ra dec parallax
npartitions=4
Order: 2, Pixel: 67 int64[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow]
Order: 2, Pixel: 70 ... ... ... ...
Order: 2, Pixel: 73 ... ... ... ...
Order: 2, Pixel: 76 ... ... ... ...
The catalog has been loaded lazily, meaning no data has been read, only the catalog schema

We will do the same with Gaia EDR3 but the distance column we will use is called r_med_geo, the median of the geometric distance estimate.

[3]:
gaia_edr3 = lsdb.read_hats(
    "https://data.lsdb.io/hats/gaia_dr3/gaia_edr3_distances",
    margin_cache="https://data.lsdb.io/hats/gaia_dr3/gaia_edr3_distances_10arcs",
    columns=["source_id", "ra", "dec", "r_med_geo"],
    search_filter=ConeSearch(ra=0, dec=0, radius_arcsec=10 * 3600),
)
gaia_edr3
[3]:
lsdb Catalog gaia_edr3_distances:
source_id ra dec r_med_geo
npartitions=4
Order: 2, Pixel: 67 int64[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow]
Order: 2, Pixel: 70 ... ... ... ...
Order: 2, Pixel: 73 ... ... ... ...
Order: 2, Pixel: 76 ... ... ... ...
The catalog has been loaded lazily, meaning no data has been read, only the catalog schema

We are now able to join both catalogs on the source_id column, as follows:

[4]:
joined = gaia.join(gaia_edr3, left_on="source_id", right_on="source_id")
joined
[4]:
lsdb Catalog gaia:
source_id_gaia ra_gaia dec_gaia parallax_gaia source_id_gaia_edr3_distances ra_gaia_edr3_distances dec_gaia_edr3_distances r_med_geo_gaia_edr3_distances
npartitions=4
Order: 2, Pixel: 67 int64[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow] int64[pyarrow] double[pyarrow] double[pyarrow] double[pyarrow]
Order: 2, Pixel: 70 ... ... ... ... ... ... ... ...
Order: 2, Pixel: 73 ... ... ... ... ... ... ... ...
Order: 2, Pixel: 76 ... ... ... ... ... ... ... ...
The catalog has been loaded lazily, meaning no data has been read, only the catalog schema

Let’s calculate a histogram with the ratio in catalog distances.

[5]:
results = (1e3 / joined["parallax_gaia"]) / joined["r_med_geo_gaia_edr3_distances"]
ratios = results.compute().to_numpy()
ratios
[5]:
array([  1.04369179,   0.99821673,   0.96884423, ...,   1.03218566,
       147.42381857,   1.41197562], shape=(825678,))
[6]:
import numpy as np
import matplotlib.pyplot as plt

plt.hist(ratios, bins=np.linspace(0.8, 1.2, 100))
plt.title("Histogram of Gaia distance / Gaia EDR3 distance")
plt.show()
../_images/tutorials_join_catalogs_10_0.png