Building a Federated Data Catalog with Client Implementations - Meeting Data Where It is
Much of the dialog and technical advancement surrounding the use of earth observation data is centered on data creators and providers, whose official responsibilities end at data delivery. While providers like NOAA, NASA, the USGS have taken critical steps to collect and publicly host data; they are spread across a range of data locations Requiring varying type of data access protocols. Finding, accessing, and extracting subsets of data from these varied providers is a burdensome and often challenging task that could be minimized with a automatically refreshing (“Automatic Refresh”), federated data catalog (“Federated Flat Catalog”) with common language software for access (“Programmatic Access”).
This combination of a auto-refreshing catalog paired with multi language implementations (R and Python), allowed the catalog to grow its data holdings from 11 to over 2,000 data providers and share the catalog as JSON and parquet files from a github.io page. To highlight how they might be used, “Examples: Figure (A)” shows how one might extract elevation from the USGS National Map A3 account, POLARIS soils data from Duke FTP server, Landcover from the USGS LCMAP team over HTTPS, and a derived wetness index from a Lynker s3 bucket for the city of Fort Collins using the catalog and generic dap() function. In “Examples: Figure (B)”, we are able to subset 4 days of rainfall data for the state of Florida using a climatePy shortcut for TerraClimate data.