Joining catalogs#
In this tutorial we join a small cone region of Gaia with Gaia Early Data Release 3 (EDR3) and compute the ratio between the distances given by their parallax
and r_med_geo
columns, respectively.
[1]:
import lsdb
from lsdb import ConeSearch
First we load Gaia with its objects source_id
, their positions and parallax
columns.
[2]:
gaia = lsdb.read_hats(
"https://data.lsdb.io/hats/gaia_dr3/gaia",
margin_cache="https://data.lsdb.io/hats/gaia_dr3/gaia_10arcs",
columns=["source_id", "ra", "dec", "parallax"],
search_filter=ConeSearch(ra=0, dec=0, radius_arcsec=10 * 3600),
)
gaia
[2]:
lsdb Catalog gaia:
source_id | ra | dec | parallax | |
---|---|---|---|---|
npartitions=4 | ||||
Order: 2, Pixel: 67 | int64[pyarrow] | double[pyarrow] | double[pyarrow] | double[pyarrow] |
Order: 2, Pixel: 70 | ... | ... | ... | ... |
Order: 2, Pixel: 73 | ... | ... | ... | ... |
Order: 2, Pixel: 76 | ... | ... | ... | ... |
The catalog has been loaded lazily, meaning no data has been read, only the catalog schema
We will do the same with Gaia EDR3 but the distance column we will use is called r_med_geo
, the median of the geometric distance estimate.
[3]:
gaia_edr3 = lsdb.read_hats(
"https://data.lsdb.io/hats/gaia_dr3/gaia_edr3_distances",
margin_cache="https://data.lsdb.io/hats/gaia_dr3/gaia_edr3_distances_10arcs",
columns=["source_id", "ra", "dec", "r_med_geo"],
search_filter=ConeSearch(ra=0, dec=0, radius_arcsec=10 * 3600),
)
gaia_edr3
[3]:
lsdb Catalog gaia_edr3_distances:
source_id | ra | dec | r_med_geo | |
---|---|---|---|---|
npartitions=4 | ||||
Order: 2, Pixel: 67 | int64[pyarrow] | double[pyarrow] | double[pyarrow] | double[pyarrow] |
Order: 2, Pixel: 70 | ... | ... | ... | ... |
Order: 2, Pixel: 73 | ... | ... | ... | ... |
Order: 2, Pixel: 76 | ... | ... | ... | ... |
The catalog has been loaded lazily, meaning no data has been read, only the catalog schema
We are now able to join both catalogs on the source_id
column, as follows:
[4]:
joined = gaia.join(gaia_edr3, left_on="source_id", right_on="source_id")
joined
[4]:
lsdb Catalog gaia:
source_id_gaia | ra_gaia | dec_gaia | parallax_gaia | source_id_gaia_edr3_distances | ra_gaia_edr3_distances | dec_gaia_edr3_distances | r_med_geo_gaia_edr3_distances | |
---|---|---|---|---|---|---|---|---|
npartitions=4 | ||||||||
Order: 2, Pixel: 67 | int64[pyarrow] | double[pyarrow] | double[pyarrow] | double[pyarrow] | int64[pyarrow] | double[pyarrow] | double[pyarrow] | double[pyarrow] |
Order: 2, Pixel: 70 | ... | ... | ... | ... | ... | ... | ... | ... |
Order: 2, Pixel: 73 | ... | ... | ... | ... | ... | ... | ... | ... |
Order: 2, Pixel: 76 | ... | ... | ... | ... | ... | ... | ... | ... |
The catalog has been loaded lazily, meaning no data has been read, only the catalog schema
Let’s calculate a histogram with the ratio in catalog distances.
[5]:
results = (1e3 / joined["parallax_gaia"]) / joined["r_med_geo_gaia_edr3_distances"]
ratios = results.compute().to_numpy()
ratios
[5]:
array([ 1.04369179, 0.99821673, 0.96884423, ..., 1.03218566,
147.42381857, 1.41197562], shape=(825678,))
[6]:
import numpy as np
import matplotlib.pyplot as plt
plt.hist(ratios, bins=np.linspace(0.8, 1.2, 100))
plt.title("Histogram of Gaia distance / Gaia EDR3 distance")
plt.show()
