Manual catalog verification#

This notebook presents methods for verifying that a directory contains a valid HATS catalog and performing manual verification through inspecting the catalog metadata and contents.

Directory verification#

The HATS library provides a method to verify that a directory contains the appropriate metadata files.

There are a few flavors of the validation, and the quickest one doesn’t take any additional flags:

[1]:
from hats.io.validation import is_valid_catalog
import hats
from upath import UPath

gaia_catalog_path = UPath("https://data.lsdb.io/hats/gaia_dr3/gaia/")
is_valid_catalog(gaia_catalog_path)
[1]:
True

Explaining the input and output#

The strict argument takes us through a different code path that rigorously tests the contents of all ancillary metadata files and the consistency of the partition pixels.

Here, we use the verbose=True argument to print out a little bit more information about our catalog. It will repeat the path that we’re looking at, display the total number of partitions, and calculate the approximate sky coverage, based on the area of the HATS tiles.

The fail_fast argument will determine if we break out of the method at the first sign of trouble or keep looking for validation problems. This can be useful if you’re debugging multiple points of failure in a catalog.

[2]:
is_valid_catalog(gaia_catalog_path, verbose=True, fail_fast=False, strict=True)
Validating catalog at path https://data.lsdb.io/hats/gaia_dr3/gaia/ ...
Found 3933 partitions.
Approximate coverage is 100.00 % of the sky.
[2]:
True

Columns and data types#

HATS tables are backed by parquet files. These files store metadata about their columns, the data types, and even the range of values.

The columns and types are stored on the catalog.schema attribute with a pyarrow.Schema object. You can find more details on this object and its use in the pyarrow documents

Gaia has a lot of columns, so this display is long!

[3]:
catalog_object = hats.read_hats(gaia_catalog_path)
catalog_object.schema
[3]:
_healpix_29: int64
solution_id: int64
designation: string
source_id: int64
random_index: int64
ref_epoch: double
ra: double
ra_error: double
dec: double
dec_error: double
parallax: double
parallax_error: double
parallax_over_error: double
pm: double
pmra: double
pmra_error: double
pmdec: double
pmdec_error: double
ra_dec_corr: double
ra_parallax_corr: double
ra_pmra_corr: double
ra_pmdec_corr: double
dec_parallax_corr: double
dec_pmra_corr: double
dec_pmdec_corr: double
parallax_pmra_corr: double
parallax_pmdec_corr: double
pmra_pmdec_corr: double
astrometric_n_obs_al: int64
astrometric_n_obs_ac: int64
astrometric_n_good_obs_al: int64
astrometric_n_bad_obs_al: int64
astrometric_gof_al: double
astrometric_chi2_al: double
astrometric_excess_noise: double
astrometric_excess_noise_sig: double
astrometric_params_solved: int64
astrometric_primary_flag: bool
nu_eff_used_in_astrometry: double
pseudocolour: double
pseudocolour_error: double
ra_pseudocolour_corr: double
dec_pseudocolour_corr: double
parallax_pseudocolour_corr: double
pmra_pseudocolour_corr: double
pmdec_pseudocolour_corr: double
astrometric_matched_transits: int64
visibility_periods_used: int64
astrometric_sigma5d_max: double
matched_transits: int64
new_matched_transits: int64
matched_transits_removed: int64
ipd_gof_harmonic_amplitude: double
ipd_gof_harmonic_phase: double
ipd_frac_multi_peak: int64
ipd_frac_odd_win: int64
ruwe: double
scan_direction_strength_k1: double
scan_direction_strength_k2: double
scan_direction_strength_k3: double
scan_direction_strength_k4: double
scan_direction_mean_k1: double
scan_direction_mean_k2: double
scan_direction_mean_k3: double
scan_direction_mean_k4: double
duplicated_source: bool
phot_g_n_obs: int64
phot_g_mean_flux: double
phot_g_mean_flux_error: double
phot_g_mean_flux_over_error: double
phot_g_mean_mag: double
phot_bp_n_obs: int64
phot_bp_mean_flux: double
phot_bp_mean_flux_error: double
phot_bp_mean_flux_over_error: double
phot_bp_mean_mag: double
phot_rp_n_obs: int64
phot_rp_mean_flux: double
phot_rp_mean_flux_error: double
phot_rp_mean_flux_over_error: double
phot_rp_mean_mag: double
phot_bp_rp_excess_factor: double
phot_bp_n_contaminated_transits: double
phot_bp_n_blended_transits: double
phot_rp_n_contaminated_transits: double
phot_rp_n_blended_transits: double
phot_proc_mode: double
bp_rp: double
bp_g: double
g_rp: double
radial_velocity: double
radial_velocity_error: double
rv_method_used: double
rv_nb_transits: double
rv_nb_deblended_transits: double
rv_visibility_periods_used: double
rv_expected_sig_to_noise: double
rv_renormalised_gof: double
rv_chisq_pvalue: double
rv_time_duration: double
rv_amplitude_robust: double
rv_template_teff: double
rv_template_logg: double
rv_template_fe_h: double
rv_atm_param_origin: double
vbroad: double
vbroad_error: double
vbroad_nb_transits: double
grvs_mag: double
grvs_mag_error: double
grvs_mag_nb_transits: double
rvs_spec_sig_to_noise: double
phot_variable_flag: string
l: double
b: double
ecl_lon: double
ecl_lat: double
in_qso_candidates: bool
in_galaxy_candidates: bool
non_single_star: int64
has_xp_continuous: bool
has_xp_sampled: bool
has_rvs: bool
has_epoch_photometry: bool
has_epoch_rv: bool
has_mcmc_gspphot: bool
has_mcmc_msc: bool
in_andromeda_survey: bool
classprob_dsc_combmod_quasar: double
classprob_dsc_combmod_galaxy: double
classprob_dsc_combmod_star: double
teff_gspphot: double
teff_gspphot_lower: double
teff_gspphot_upper: double
logg_gspphot: double
logg_gspphot_lower: double
logg_gspphot_upper: double
mh_gspphot: double
mh_gspphot_lower: double
mh_gspphot_upper: double
distance_gspphot: double
distance_gspphot_lower: double
distance_gspphot_upper: double
azero_gspphot: double
azero_gspphot_lower: double
azero_gspphot_upper: double
ag_gspphot: double
ag_gspphot_lower: double
ag_gspphot_upper: double
ebpminrp_gspphot: double
ebpminrp_gspphot_lower: double
ebpminrp_gspphot_upper: double
libname_gspphot: string
Norder: int8
Dir: int64
Npix: int64

Column statistics#

Parquet maintains basic statistics about the data inside its files. This includes the minimum value, maximum value, and the number of null (None, or unspecified) rows for that column.

We provide a method that consumes all of the min, max, and null counts, and provides global values of min and max, and a total sum of the null counts.

[4]:
catalog_object.aggregate_column_statistics()
[4]:
min_value max_value null_count
column_names
solution_id 1636148068921376768 1636148068921376768 0
designation Gaia DR3 1000000057322000000 Gaia DR3 999999988604363776 0
source_id 4295806720 6917528997577384320 0
random_index 0 1811709770 0
ref_epoch 2016.0 2016.0 0
... ... ... ...
ag_gspphot_upper 0.0001 7.4111 1340950508
ebpminrp_gspphot -0.0 4.2257 1340950508
ebpminrp_gspphot_lower -0.0 4.2245 1340950508
ebpminrp_gspphot_upper 0.0001 4.2262 1340950508
libname_gspphot A PHOENIX 1340950508

152 rows × 3 columns

Again, gaia has a lot of columns. To make the most of this output, you can either use a pandas option to display all of the rows:

import pandas as pd
pd.set_option('display.max_rows', None)

Or restrict the columns to those you care about with a keyword argument:

[5]:
catalog_object.aggregate_column_statistics(include_columns=["ra", "dec", "ref_epoch"])
[5]:
min_value max_value null_count
column_names
ref_epoch 2.016000e+03 2016.000000 0.0
ra 3.409624e-07 360.000000 0.0
dec -8.999288e+01 89.990052 0.0