Types of Crossmatches#
In this tutorial, we will:
explain what
how='inner'(the default) andhow='left'mean when crossmatching catalogsdemonstrate how each join type affects the row count and column values in the result
explain how nested crossmatching works with
how='left'andhow='inner'
Additional Help
For tips on accessing remote data, see our Accessing remote data guide. For a full crossmatching walkthrough, see the Crossmatching catalogs tutorial.
1. Opening catalogs#
We open a small cone of ZTF DR22 (left catalog) and Gaia DR3 (right catalog).
[3]:
import warnings
import lsdb
warnings.filterwarnings("ignore")
[4]:
ztf = lsdb.open_catalog(
"https://data.lsdb.io/hats/ztf_dr22",
columns=["objectid", "objra", "objdec", "nepochs"],
search_filter=lsdb.ConeSearch(280, 0, radius_arcsec=36),
)
gaia = lsdb.open_catalog(
"s3://stpubdata/gaia/gaia_dr3/public/hats",
columns=["source_id", "ra", "dec", "phot_g_mean_mag"],
search_filter=lsdb.ConeSearch(280, 0, radius_arcsec=36),
)
[5]:
n_ztf = len(ztf.compute())
n_gaia = len(gaia.compute())
print(f"ZTF sources in cone : {n_ztf}")
print(f"Gaia sources in cone : {n_gaia}")
ZTF sources in cone : 129
Gaia sources in cone : 17
2. What is the difference between the “left” and “right” catalogs?#
When you write
catalog_a.crossmatch(catalog_b, ...)
catalog_a is the left catalog - the catalog you call .crossmatch() on, and catalog_b is the right catalog - the catalog you pass as the first argument.
The crossmatch algorithm iterates over every row in the left catalog and searches for its nearest neighbor in the right catalog. This means that from the left catalog there will never be duplicates, each row in the left catalog will appear at most once in the output. But from the right catalog, there can be duplicates — if multiple left-catalog rows are close to the same right-catalog row, they will all match to it and appear in the result.
The how parameter determines how rows in the left catalog that have no counterpart in the right catalog are handled. If how='inner' (the default), those rows are dropped from the result. If how='left', those rows are kept in the result, but all right-catalog columns for those rows are filled with <NA>.
If you’re using the lsdb.crossmatch function, the left and right catalogs are determined by the order of the arguments:
result = lsdb.crossmatch(catalog_a, catalog_b, ...)
With catalog_a as the left catalog and catalog_b as the right catalog.
3. Inner crossmatch (how='inner', the default)#
how='inner' keeps only the rows where a match was found in both catalogs. ZTF sources that fall outside the 1-arcsecond search radius of every bright Gaia source are dropped from the result.
This is the default, so ztf.crossmatch(gaia_bright, radius_arcsec=1.0) and ztf.crossmatch(gaia_bright, radius_arcsec=1.0, how='inner') are equivalent.
[6]:
gaia_bright = gaia.query("phot_g_mean_mag < 20")
n_bright = len(gaia_bright.compute())
inner_result = ztf.crossmatch(
gaia_bright,
radius_arcsec=1.0,
how="inner",
suffix_method="overlapping_columns",
log_changes=False,
).compute()
inner_result[["objectid", "nepochs", "source_id", "phot_g_mean_mag", "_dist_arcsec"]]
[6]:
| objectid | nepochs | source_id | phot_g_mean_mag | _dist_arcsec | |
|---|---|---|---|---|---|
| _healpix_29 | |||||
| 2136230511175494507 | 435313300083253 | 174 | 4272461018841219072 | 15.848508 | 0.043069 |
| 2136230511186576907 | 435113300001005 | 377 | 4272461018841219072 | 15.848508 | 0.09387 |
| 2136230511186589223 | 1481108400021268 | 14 | 4272461018841219072 | 15.848508 | 0.105821 |
| 2136230511186591124 | 1481208400054176 | 31 | 4272461018841219072 | 15.848508 | 0.111156 |
| 2136230511187068011 | 435213300027173 | 921 | 4272461018841219072 | 15.848508 | 0.116384 |
| 2136230517889210485 | 435213300004315 | 498 | 4272461014537456000 | 19.777922 | 0.059946 |
| 2136230517889237321 | 435313300112378 | 169 | 4272461014537456000 | 19.777922 | 0.151346 |
| 2136230518300411795 | 435313300008254 | 174 | 4272461014536964480 | 18.72089 | 0.05034 |
| 2136230518300436707 | 1481208400054105 | 23 | 4272461014536964480 | 18.72089 | 0.087248 |
| 2136230518300465965 | 435213300004348 | 837 | 4272461014536964480 | 18.72089 | 0.123049 |
| 2136230518301721633 | 435113300025722 | 17 | 4272461014536964480 | 18.72089 | 0.291385 |
| 2136230519233805001 | 435313300008165 | 166 | 4272461014537456128 | 19.814991 | 0.033115 |
| 2136230519233829973 | 435213300045612 | 457 | 4272461014537456128 | 19.814991 | 0.019936 |
| 2136230519234225171 | 1481208400144813 | 10 | 4272461014537456128 | 19.814991 | 0.09365 |
| 2136230529701382414 | 1481208300087470 | 19 | 4272461048901519744 | 18.847134 | 0.171081 |
| 2136230529701464098 | 435313300048022 | 171 | 4272461048901519744 | 18.847134 | 0.121662 |
| 2136230529701839554 | 435213300068888 | 746 | 4272461048901519744 | 18.847134 | 0.215594 |
| 2136230529707217992 | 435113300019002 | 5 | 4272461048901519744 | 18.847134 | 0.882785 |
| 2136230531210057600 | 1481208300036047 | 21 | 4272461048896704384 | 17.559347 | 0.135669 |
| 2136230531211503084 | 1481108300010329 | 7 | 4272461048896704384 | 17.559347 | 0.150126 |
| 2136230531211541068 | 435313300083184 | 174 | 4272461048896704384 | 17.559347 | 0.059375 |
| 2136230531211583811 | 435113300013833 | 235 | 4272461048896704384 | 17.559347 | 0.065542 |
| 2136230531211681091 | 435213300027132 | 909 | 4272461048896704384 | 17.559347 | 0.09833 |
23 rows × 5 columns
[7]:
print(f"ZTF sources (left) : {n_ztf}")
print(f"Bright Gaia sources (right): {n_bright}")
print(
f"Inner result rows : {len(inner_result)} \u2190 {n_ztf - len(inner_result)} ZTF rows dropped"
)
ZTF sources (left) : 129
Bright Gaia sources (right): 6
Inner result rows : 23 ← 106 ZTF rows dropped
4. Left crossmatch (how='left')#
how='left' keeps every row from the left catalog, regardless of whether a match was found. For ZTF sources that have no bright Gaia counterpart within 1 arcsecond, all Gaia columns in that row are filled with <NA>.
The row count of the result equals the row count of the left catalog (when n_neighbors=1).
[8]:
left_result = ztf.crossmatch(
gaia_bright,
radius_arcsec=1.0,
how="left",
suffix_method="overlapping_columns",
log_changes=False,
).compute()
left_result[["objectid", "nepochs", "source_id", "phot_g_mean_mag", "_dist_arcsec"]]
[8]:
| objectid | nepochs | source_id | phot_g_mean_mag | _dist_arcsec | |
|---|---|---|---|---|---|
| _healpix_29 | |||||
| 2136230511175494507 | 435313300083253 | 174 | 4272461018841219072 | 15.848508 | 0.043069 |
| 2136230511186576907 | 435113300001005 | 377 | 4272461018841219072 | 15.848508 | 0.09387 |
| ... | ... | ... | ... | ... | ... |
| 2136230547704034321 | 435213300004122 | 2 | <NA> | <NA> | <NA> |
| 2136230550490352432 | 435313300007934 | 60 | <NA> | <NA> | <NA> |
129 rows × 5 columns
[9]:
print(f"ZTF sources (left) : {n_ztf}")
print(f"Bright Gaia sources (right): {n_bright}")
print(f"Left result rows : {len(left_result)} \u2190 all ZTF rows preserved")
ZTF sources (left) : 129
Bright Gaia sources (right): 6
Left result rows : 129 ← all ZTF rows preserved
4.1 Identifying unmatched sources#
Because unmatched rows have <NA> in all right-catalog columns, you can filter them with a null check on any right-catalog column — for example source_id:
[10]:
n_matched = left_result["source_id"].notna().sum()
n_unmatched = left_result["source_id"].isna().sum()
print(f"ZTF sources with a bright Gaia match : {n_matched:3d} ({n_matched/len(left_result):.1%})")
print(f"ZTF sources with no bright Gaia match : {n_unmatched:3d} ({n_unmatched/len(left_result):.1%})")
ZTF sources with a bright Gaia match : 23 (17.8%)
ZTF sources with no bright Gaia match : 106 (82.2%)
Why no how=’right’ or how=’outer’?
The how parameter in lsdb.crossmatch only supports ‘inner’ and ‘left’ because of the way lsdb maintains partitioning for efficient crossmatching. To maintain the hats partitioning scheme, the output catalog must have a ra and dec column that is used to determine the row’s position, and since we need to use the right catalog’s margin to ensure the crossmatch is correct, the right catalog’s ra and dec cannot be used for this purpose. This means that every row in the output must correspond to a row in the left catalog, which is not compatible with how=’right’ or how=’outer’ where rows could exist in the output that have no counterpart in the left catalog.
That said, you can achieve a similar effect as how=’right’ by swapping the order of the catalogs and using how=’left’.
5. Nested Crossmatch with how='left' vs how='inner'#
When you perform a nested crossmatch, the how parameter works in the same way. With how='inner', only rows with matches in both catalogs are kept in the output catalog, and the right catalog nested column always contains a non-empty table. With how='left', all rows from the left catalog are kept, and the right catalog nested column contains a None value for rows with no match.
[16]:
# Create a nested crossmatch with how='inner'
nested_inner = ztf.crossmatch_nested(
gaia_bright,
radius_arcsec=1.0,
how="inner",
).compute()
nested_inner[["objectid", "nepochs", "gaia"]]
[16]:
| objectid | nepochs | gaia | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2136230511175494507 | 435313300083253 | 174 |
|
|||||||||||||||
| 2136230511186576907 | 435113300001005 | 377 |
|
|||||||||||||||
| 2136230511186589223 | 1481108400021268 | 14 |
|
|||||||||||||||
| 2136230511186591124 | 1481208400054176 | 31 |
|
|||||||||||||||
| 2136230511187068011 | 435213300027173 | 921 |
|
|||||||||||||||
| 2136230517889210485 | 435213300004315 | 498 |
|
|||||||||||||||
| 2136230517889237321 | 435313300112378 | 169 |
|
|||||||||||||||
| 2136230518300411795 | 435313300008254 | 174 |
|
|||||||||||||||
| 2136230518300436707 | 1481208400054105 | 23 |
|
|||||||||||||||
| 2136230518300465965 | 435213300004348 | 837 |
|
|||||||||||||||
| 2136230518301721633 | 435113300025722 | 17 |
|
|||||||||||||||
| 2136230519233805001 | 435313300008165 | 166 |
|
|||||||||||||||
| 2136230519233829973 | 435213300045612 | 457 |
|
|||||||||||||||
| 2136230519234225171 | 1481208400144813 | 10 |
|
|||||||||||||||
| 2136230529701382414 | 1481208300087470 | 19 |
|
|||||||||||||||
| 2136230529701464098 | 435313300048022 | 171 |
|
|||||||||||||||
| 2136230529701839554 | 435213300068888 | 746 |
|
|||||||||||||||
| 2136230529707217992 | 435113300019002 | 5 |
|
|||||||||||||||
| 2136230531210057600 | 1481208300036047 | 21 |
|
|||||||||||||||
| 2136230531211503084 | 1481108300010329 | 7 |
|
|||||||||||||||
| 2136230531211541068 | 435313300083184 | 174 |
|
|||||||||||||||
| 2136230531211583811 | 435113300013833 | 235 |
|
|||||||||||||||
| 2136230531211681091 | 435213300027132 | 909 |
|
[17]:
print(f"Nested inner result rows : {len(nested_inner)} \u2190 all ZTF rows with a bright Gaia match")
Nested inner result rows : 23 ← all ZTF rows with a bright Gaia match
[18]:
# Create a nested crossmatch with how='left'
nested_left = ztf.crossmatch_nested(
gaia_bright,
radius_arcsec=1.0,
how="left",
).compute()
nested_left[["objectid", "nepochs", "gaia"]]
[18]:
| objectid | nepochs | gaia | |
|---|---|---|---|
| 2136230497219052831 | 435313300008704 | 18 | None |
| 2136230497608086572 | 435213300074278 | 15 | None |
| 2136230497630747626 | 435313300083450 | 81 | None |
| 2136230497631224467 | 1481208400054451 | 1 | None |
| 2136230498196657357 | 435313300008591 | 114 | None |
| ... | ... | ... | ... |
[19]:
print(f"Nested left result rows : {len(nested_left)} \u2190 all ZTF rows preserved")
Nested left result rows : 129 ← all ZTF rows preserved
About#
Author(s): Sean McGuire
Last updated on: 19 May 2026
If you use lsdb for published research, please cite following instructions.