Which Levels of Data Processing Should We Persist and Make FAIR?
In the year of Open-source Science researchers are being encouraged to make a “commitment to the open sharing of samples, software, data, and knowledge (algorithms, papers, documents, ancillary information) as early as possible in the scientific process”.
Most datasets of today begin with raw, full resolution, unprocessed data collected by either remote or in situ instruments, followed by Analysis Ready Data (ARD) where raw field data are calibrated, annotated and georeferenced (L0, L1), through to Interpretation Ready Data (IRD), where multiple data products (L2, L3, L4) can derived from any L0 and L1 ARD.
Many ARDs can be created from each raw field dataset, and hundreds, if not thousands of IRDs can be created from each ARD. In the year of Open-source Science reproducibility and transparency are paramount, and traceability from the raw field data and then through to each of the L0-L4 processing levels is critical not only for vouching for the integrity of the derived products, but also to enable attribution and credit to researchers, institutions and funders of the raw field data as well as each level of processing.
However, the reality is that in the ‘Big Data’ research areas such as geophysics, remote sensing and climate, where data volumes are measured in TBs and PBs keeping copies of the raw field data and all subsequent levels of data processing, as well as each individual derivative data products, is not plausible. So how do we decide which raw field datasets and which levels of processing we need to persist and make FAIR? This poster was presented at the 2023 January Earth Science Information Partners (ESIP) Meeting held virtually Jan 23-27, 2023.