An Automated Approach to Labelling Datasets in Earth Science Publications
NASA Data Active Archive Centers, or DAACs, ingest, store, and distribute data acquired from satellites, ground systems as well as reanalysis models. Many authors use this data in their research. However, most of the datasets used in Earth Science Publications are not cited correctly or not cited at all. Thus, there is no direct link between the datasets used and the scientific publications which reference them. This leads to issues with reproducibility of the results, attribution of the of research results, and discovery of new datasets. This project began by exploring various methods of automatically labelling Goddard Earth Sciences Data and Information Services Center (GES DISC) datasets using Supervised Machine Learning and Earth Data Search Common Metadata Repository (CMR) queries. The ultimate goal was to create a library of citations that utilized automated citation labeling to directly link the research publications to the data they use. Supervised Machine Learning approaches struggled due to the limited amount of labelled training data to learn from. Increasing the volume of training data is difficult as it requires subject matter experts devote time to manually reviewing journal articles and determining the datasets used. The CMR queries were inconsistent because the underlying metadata is continuously being updated. Thus, it is hard to generalize the effectiveness of the CMR results as they are dependent on the internal state of CMR. These approaches helped inform the decision to transition the project into using a Knowledge Graph. Another key aspect of this project focused on the automated extraction of features (platform, instrument, variables, etc) and explicit citations from within Earth Science Publications. These automated extractions were used to classify research papers based on their platform/instrument couples. This information was input into the Citation Management System for GES DISC. These platform/instrument couples also provide an additional facet that can be searched on the GES DISC website. This poster was presented at the ESIP Summer Meeting held online in July 2021.