docs/source/working-notes/troyraen/ingest-ztf-archive/README.md
Ingest ZTF Archive
The ZTF archive was ingested into the “ztf_alerts_v*” buckets and tables in early 2023. The following files contain the working notes and code that were used:
A cleaned-up version of the code that can be used in the future is at:
Instructions
Setup a VM to use for the ingestion (setup.sh)
Setup other GCP resources like buckets and tables if needed (setup.sh)
Ingest tarballs (ingest_tarballs.py). Basic usage:
import ingest_tarballs as it
it.run(tardf=it.fetch_tarball_names(clean=["done"]))
This does the following:
load list of tarballs available from ZTF archive cleaned of any that have already been logged as done
ingest all tarballs, one at a time:
download and extract the tarball
fix the avro schemas to conform to bigquery’s strict standards (parallelized)
upload the alerts (avro files) to the bucket (parallelized)
load the alerts to the table
write logs to logs/ingest_tarballs.log and create other files in the logs directory to keep track of what’s completed thru various stages
Some steps are parallelized.
Defaults should work reasonably well on the SSD machine created in setup.sh.
In addition, a single machine can probably do a few it.run()
in parallel using (e.g.,) screen
and submitting slices of tardf
to each run
call.