posted on 2021-08-09, 16:22authored byEdward Jahoda
<p>NASA Data Active Archive Centers, or DAACs, ingest,
store, and distribute data acquired from satellites, ground systems as well as
reanalysis models. Many authors use this data in their research. However, most
of the datasets used in Earth Science Publications are not cited correctly or
not cited at all. Thus, there is no direct link between the datasets used and
the scientific publications which reference them. This leads to issues with
reproducibility of the results, attribution of the of research results, and
discovery of new datasets. This project began by exploring various methods of
automatically labelling Goddard Earth Sciences Data and Information Services
Center (GES DISC) datasets using Supervised Machine Learning and Earth Data
Search Common Metadata Repository (CMR) queries. The
ultimate goal was to create a library of citations that utilized automated
citation labeling to directly link the research publications to the data they
use. Supervised Machine Learning approaches struggled due to the limited amount
of labelled training data to learn from. Increasing the volume of training data
is difficult as it requires subject matter experts devote time to manually
reviewing journal articles and determining the datasets used. The CMR queries
were inconsistent because the underlying metadata is continuously being
updated. Thus, it is hard to generalize the effectiveness of the CMR results as
they are dependent on the internal state of CMR. These approaches helped inform
the decision to transition the project into using a Knowledge Graph. Another
key aspect of this project focused on the automated extraction of features
(platform, instrument, variables, etc) and explicit citations from within Earth
Science Publications. These automated extractions were used to classify research
papers based on their platform/instrument couples. This information was input
into the Citation Management System for GES DISC. These platform/instrument
couples also provide an additional facet that can be searched on the GES DISC
website. This poster was presented at the ESIP Summer Meeting held online in July 2021.</p>