ESIP_Summer_2021_Dayal,Rohan_Poster.pdf (612.1 kB)

Automated classification of scientific publications linked to GES DISC datasets

Download (612.1 kB)
posted on 2021-07-29, 14:30 authored by Rohan Dayal, Irina Gerasimov, Armin Mehrabian, Jennifer Wei, Mohammad Khayat, Andrey Savtchenko
The data collections archived and distributed by the GES DISC NASA data center are widely utilized for various Earth Science studies. As these collections are created, many research works are published regarding the collections, algorithms, validations and applications. Since GES DISC collects these publications and provides their citations for the users, it is helpful to categorize them based on how they relate to the datasets they are associated with. Specifically, whether the publication that is linked to GES DISC dataset is using it for applicational research, or if it describes the algorithm for dataset creation, or the validation of the dataset, or provides the general overview of the data collection. Currently, this process requires simple manual labelling, and as such, may be possible to solve via automation. To approach this problem, we developed machine learning classifiers to predict the category a publication belongs to. We used manually labeled publications as training data for supervised machine learning algorithms: Random Forest and Naive Bayes. We achieved classification accuracy that is substantially better than the baseline accuracy, thus greatly improving the efficiency of the publication internal analysis. This poster was presented at the ESIP Summer Meeting in July 2021.