Types of Crossmatches#

In this tutorial, we will:

  • explain what how='inner' (the default) and how='left' mean when crossmatching catalogs

  • demonstrate how each join type affects the row count and column values in the result

  • explain how nested crossmatching works with how='left' and how='inner'

Additional Help

For tips on accessing remote data, see our Accessing remote data guide. For a full crossmatching walkthrough, see the Crossmatching catalogs tutorial.

1. Opening catalogs#

We open a small cone of ZTF DR22 (left catalog) and Gaia DR3 (right catalog).

[3]:
import warnings
import lsdb

warnings.filterwarnings("ignore")
[4]:
ztf = lsdb.open_catalog(
    "https://data.lsdb.io/hats/ztf_dr22",
    columns=["objectid", "objra", "objdec", "nepochs"],
    search_filter=lsdb.ConeSearch(280, 0, radius_arcsec=36),
)

gaia = lsdb.open_catalog(
    "s3://stpubdata/gaia/gaia_dr3/public/hats",
    columns=["source_id", "ra", "dec", "phot_g_mean_mag"],
    search_filter=lsdb.ConeSearch(280, 0, radius_arcsec=36),
)
[5]:
n_ztf = len(ztf.compute())
n_gaia = len(gaia.compute())
print(f"ZTF sources in cone  : {n_ztf}")
print(f"Gaia sources in cone : {n_gaia}")
ZTF sources in cone  : 129
Gaia sources in cone : 17

2. What is the difference between the “left” and “right” catalogs?#

When you write

catalog_a.crossmatch(catalog_b, ...)

catalog_a is the left catalog - the catalog you call .crossmatch() on, and catalog_b is the right catalog - the catalog you pass as the first argument.

The crossmatch algorithm iterates over every row in the left catalog and searches for its nearest neighbor in the right catalog. This means that from the left catalog there will never be duplicates, each row in the left catalog will appear at most once in the output. But from the right catalog, there can be duplicates — if multiple left-catalog rows are close to the same right-catalog row, they will all match to it and appear in the result.

The how parameter determines how rows in the left catalog that have no counterpart in the right catalog are handled. If how='inner' (the default), those rows are dropped from the result. If how='left', those rows are kept in the result, but all right-catalog columns for those rows are filled with <NA>.

If you’re using the lsdb.crossmatch function, the left and right catalogs are determined by the order of the arguments:

result = lsdb.crossmatch(catalog_a, catalog_b, ...)

With catalog_a as the left catalog and catalog_b as the right catalog.

3. Inner crossmatch (how='inner', the default)#

how='inner' keeps only the rows where a match was found in both catalogs. ZTF sources that fall outside the 1-arcsecond search radius of every bright Gaia source are dropped from the result.

This is the default, so ztf.crossmatch(gaia_bright, radius_arcsec=1.0) and ztf.crossmatch(gaia_bright, radius_arcsec=1.0, how='inner') are equivalent.

[6]:
gaia_bright = gaia.query("phot_g_mean_mag < 20")
n_bright = len(gaia_bright.compute())

inner_result = ztf.crossmatch(
    gaia_bright,
    radius_arcsec=1.0,
    how="inner",
    suffix_method="overlapping_columns",
    log_changes=False,
).compute()

inner_result[["objectid", "nepochs", "source_id", "phot_g_mean_mag", "_dist_arcsec"]]
[6]:
objectid nepochs source_id phot_g_mean_mag _dist_arcsec
_healpix_29
2136230511175494507 435313300083253 174 4272461018841219072 15.848508 0.043069
2136230511186576907 435113300001005 377 4272461018841219072 15.848508 0.09387
2136230511186589223 1481108400021268 14 4272461018841219072 15.848508 0.105821
2136230511186591124 1481208400054176 31 4272461018841219072 15.848508 0.111156
2136230511187068011 435213300027173 921 4272461018841219072 15.848508 0.116384
2136230517889210485 435213300004315 498 4272461014537456000 19.777922 0.059946
2136230517889237321 435313300112378 169 4272461014537456000 19.777922 0.151346
2136230518300411795 435313300008254 174 4272461014536964480 18.72089 0.05034
2136230518300436707 1481208400054105 23 4272461014536964480 18.72089 0.087248
2136230518300465965 435213300004348 837 4272461014536964480 18.72089 0.123049
2136230518301721633 435113300025722 17 4272461014536964480 18.72089 0.291385
2136230519233805001 435313300008165 166 4272461014537456128 19.814991 0.033115
2136230519233829973 435213300045612 457 4272461014537456128 19.814991 0.019936
2136230519234225171 1481208400144813 10 4272461014537456128 19.814991 0.09365
2136230529701382414 1481208300087470 19 4272461048901519744 18.847134 0.171081
2136230529701464098 435313300048022 171 4272461048901519744 18.847134 0.121662
2136230529701839554 435213300068888 746 4272461048901519744 18.847134 0.215594
2136230529707217992 435113300019002 5 4272461048901519744 18.847134 0.882785
2136230531210057600 1481208300036047 21 4272461048896704384 17.559347 0.135669
2136230531211503084 1481108300010329 7 4272461048896704384 17.559347 0.150126
2136230531211541068 435313300083184 174 4272461048896704384 17.559347 0.059375
2136230531211583811 435113300013833 235 4272461048896704384 17.559347 0.065542
2136230531211681091 435213300027132 909 4272461048896704384 17.559347 0.09833

23 rows × 5 columns

[7]:
print(f"ZTF sources (left)         : {n_ztf}")
print(f"Bright Gaia sources (right):   {n_bright}")
print(
    f"Inner result rows          :  {len(inner_result)}  \u2190 {n_ztf - len(inner_result)} ZTF rows dropped"
)
ZTF sources (left)         : 129
Bright Gaia sources (right):   6
Inner result rows          :  23  ← 106 ZTF rows dropped

4. Left crossmatch (how='left')#

how='left' keeps every row from the left catalog, regardless of whether a match was found. For ZTF sources that have no bright Gaia counterpart within 1 arcsecond, all Gaia columns in that row are filled with <NA>.

The row count of the result equals the row count of the left catalog (when n_neighbors=1).

[8]:
left_result = ztf.crossmatch(
    gaia_bright,
    radius_arcsec=1.0,
    how="left",
    suffix_method="overlapping_columns",
    log_changes=False,
).compute()

left_result[["objectid", "nepochs", "source_id", "phot_g_mean_mag", "_dist_arcsec"]]
[8]:
objectid nepochs source_id phot_g_mean_mag _dist_arcsec
_healpix_29
2136230511175494507 435313300083253 174 4272461018841219072 15.848508 0.043069
2136230511186576907 435113300001005 377 4272461018841219072 15.848508 0.09387
... ... ... ... ... ...
2136230547704034321 435213300004122 2 <NA> <NA> <NA>
2136230550490352432 435313300007934 60 <NA> <NA> <NA>

129 rows × 5 columns

[9]:
print(f"ZTF sources (left)         : {n_ztf}")
print(f"Bright Gaia sources (right):   {n_bright}")
print(f"Left result rows           : {len(left_result)}  \u2190 all ZTF rows preserved")
ZTF sources (left)         : 129
Bright Gaia sources (right):   6
Left result rows           : 129  ← all ZTF rows preserved

4.1 Identifying unmatched sources#

Because unmatched rows have <NA> in all right-catalog columns, you can filter them with a null check on any right-catalog column — for example source_id:

[10]:
n_matched = left_result["source_id"].notna().sum()
n_unmatched = left_result["source_id"].isna().sum()
print(f"ZTF sources with a bright Gaia match    : {n_matched:3d} ({n_matched/len(left_result):.1%})")
print(f"ZTF sources with no bright Gaia match   : {n_unmatched:3d} ({n_unmatched/len(left_result):.1%})")
ZTF sources with a bright Gaia match    :  23 (17.8%)
ZTF sources with no bright Gaia match   : 106 (82.2%)

Why no how=’right’ or how=’outer’?

The how parameter in lsdb.crossmatch only supports ‘inner’ and ‘left’ because of the way lsdb maintains partitioning for efficient crossmatching. To maintain the hats partitioning scheme, the output catalog must have a ra and dec column that is used to determine the row’s position, and since we need to use the right catalog’s margin to ensure the crossmatch is correct, the right catalog’s ra and dec cannot be used for this purpose. This means that every row in the output must correspond to a row in the left catalog, which is not compatible with how=’right’ or how=’outer’ where rows could exist in the output that have no counterpart in the left catalog.

That said, you can achieve a similar effect as how=’right’ by swapping the order of the catalogs and using how=’left’.

5. Nested Crossmatch with how='left' vs how='inner'#

When you perform a nested crossmatch, the how parameter works in the same way. With how='inner', only rows with matches in both catalogs are kept in the output catalog, and the right catalog nested column always contains a non-empty table. With how='left', all rows from the left catalog are kept, and the right catalog nested column contains a None value for rows with no match.

[16]:
# Create a nested crossmatch with how='inner'
nested_inner = ztf.crossmatch_nested(
    gaia_bright,
    radius_arcsec=1.0,
    how="inner",
).compute()
nested_inner[["objectid", "nepochs", "gaia"]]
[16]:
  objectid nepochs gaia
2136230511175494507 435313300083253 174
source_id ra dec phot_g_mean_mag _dist_arcsec
4272461018841219072 280.00653 -0.002397 15.848508 0.043069
+0 rows ... ... ... ...
2136230511186576907 435113300001005 377
source_id ra dec phot_g_mean_mag _dist_arcsec
4272461018841219072 280.00653 -0.002397 15.848508 0.09387
+0 rows ... ... ... ...
2136230511186589223 1481108400021268 14
source_id ra dec phot_g_mean_mag _dist_arcsec
4272461018841219072 280.00653 -0.002397 15.848508 0.105821
+0 rows ... ... ... ...
2136230511186591124 1481208400054176 31
source_id ra dec phot_g_mean_mag _dist_arcsec
4272461018841219072 280.00653 -0.002397 15.848508 0.111156
+0 rows ... ... ... ...
2136230511187068011 435213300027173 921
source_id ra dec phot_g_mean_mag _dist_arcsec
4272461018841219072 280.00653 -0.002397 15.848508 0.116384
+0 rows ... ... ... ...
2136230517889210485 435213300004315 498
source_id ra dec phot_g_mean_mag _dist_arcsec
4272461014537456000 280.005341 0.000974 19.777922 0.059946
+0 rows ... ... ... ...
2136230517889237321 435313300112378 169
source_id ra dec phot_g_mean_mag _dist_arcsec
4272461014537456000 280.005341 0.000974 19.777922 0.151346
+0 rows ... ... ... ...
2136230518300411795 435313300008254 174
source_id ra dec phot_g_mean_mag _dist_arcsec
4272461014536964480 280.00182 -0.000606 18.72089 0.05034
+0 rows ... ... ... ...
2136230518300436707 1481208400054105 23
source_id ra dec phot_g_mean_mag _dist_arcsec
4272461014536964480 280.00182 -0.000606 18.72089 0.087248
+0 rows ... ... ... ...
2136230518300465965 435213300004348 837
source_id ra dec phot_g_mean_mag _dist_arcsec
4272461014536964480 280.00182 -0.000606 18.72089 0.123049
+0 rows ... ... ... ...
2136230518301721633 435113300025722 17
source_id ra dec phot_g_mean_mag _dist_arcsec
4272461014536964480 280.00182 -0.000606 18.72089 0.291385
+0 rows ... ... ... ...
2136230519233805001 435313300008165 166
source_id ra dec phot_g_mean_mag _dist_arcsec
4272461014537456128 280.002917 0.001124 19.814991 0.033115
+0 rows ... ... ... ...
2136230519233829973 435213300045612 457
source_id ra dec phot_g_mean_mag _dist_arcsec
4272461014537456128 280.002917 0.001124 19.814991 0.019936
+0 rows ... ... ... ...
2136230519234225171 1481208400144813 10
source_id ra dec phot_g_mean_mag _dist_arcsec
4272461014537456128 280.002917 0.001124 19.814991 0.09365
+0 rows ... ... ... ...
2136230529701382414 1481208300087470 19
source_id ra dec phot_g_mean_mag _dist_arcsec
4272461048901519744 279.99174 -0.001291 18.847134 0.171081
+0 rows ... ... ... ...
2136230529701464098 435313300048022 171
source_id ra dec phot_g_mean_mag _dist_arcsec
4272461048901519744 279.99174 -0.001291 18.847134 0.121662
+0 rows ... ... ... ...
2136230529701839554 435213300068888 746
source_id ra dec phot_g_mean_mag _dist_arcsec
4272461048901519744 279.99174 -0.001291 18.847134 0.215594
+0 rows ... ... ... ...
2136230529707217992 435113300019002 5
source_id ra dec phot_g_mean_mag _dist_arcsec
4272461048901519744 279.99174 -0.001291 18.847134 0.882785
+0 rows ... ... ... ...
2136230531210057600 1481208300036047 21
source_id ra dec phot_g_mean_mag _dist_arcsec
4272461048896704384 279.990748 -0.000366 17.559347 0.135669
+0 rows ... ... ... ...
2136230531211503084 1481108300010329 7
source_id ra dec phot_g_mean_mag _dist_arcsec
4272461048896704384 279.990748 -0.000366 17.559347 0.150126
+0 rows ... ... ... ...
2136230531211541068 435313300083184 174
source_id ra dec phot_g_mean_mag _dist_arcsec
4272461048896704384 279.990748 -0.000366 17.559347 0.059375
+0 rows ... ... ... ...
2136230531211583811 435113300013833 235
source_id ra dec phot_g_mean_mag _dist_arcsec
4272461048896704384 279.990748 -0.000366 17.559347 0.065542
+0 rows ... ... ... ...
2136230531211681091 435213300027132 909
source_id ra dec phot_g_mean_mag _dist_arcsec
4272461048896704384 279.990748 -0.000366 17.559347 0.09833
+0 rows ... ... ... ...
23 rows x 3 columns
[17]:
print(f"Nested inner result rows : {len(nested_inner)}  \u2190 all ZTF rows with a bright Gaia match")
Nested inner result rows : 23  ← all ZTF rows with a bright Gaia match
[18]:
# Create a nested crossmatch with how='left'
nested_left = ztf.crossmatch_nested(
    gaia_bright,
    radius_arcsec=1.0,
    how="left",
).compute()
nested_left[["objectid", "nepochs", "gaia"]]
[18]:
  objectid nepochs gaia
2136230497219052831 435313300008704 18 None
2136230497608086572 435213300074278 15 None
2136230497630747626 435313300083450 81 None
2136230497631224467 1481208400054451 1 None
2136230498196657357 435313300008591 114 None
... ... ... ...
129 rows x 3 columns
[19]:
print(f"Nested left result rows  : {len(nested_left)}  \u2190 all ZTF rows preserved")
Nested left result rows  : 129  ← all ZTF rows preserved

About#

Author(s): Sean McGuire

Last updated on: 19 May 2026

If you use lsdb for published research, please cite following instructions.