MaTableGPT: GPT-Based Table Data Extractor from Materials Science Literature.

Journal: Advanced science (Weinheim, Baden-Wurttemberg, Germany)
Published Date:

Abstract

Efficiently extracting data from tables in the scientific literature is pivotal for building large-scale databases. However, the tables reported in materials science papers exist in highly diverse forms; thus, rule-based extractions are an ineffective approach. To overcome this challenge, the study presents MaTableGPT, which is a GPT-based table data extractor from the materials science literature. MaTableGPT features key strategies of table data representation and table splitting for better GPT comprehension and filtering hallucinated information through follow-up questions. When applied to a vast volume of water splitting catalysis literature, MaTableGPT achieves an extraction accuracy (total F1 score) of up to 96.8%. Through comprehensive evaluations of the GPT usage cost, labeling cost, and extraction accuracy for the learning methods of zero-shot, few-shot, and fine-tuning, the study presents a Pareto-front mapping where the few-shot learning method is found to be the most balanced solution owing to both its high extraction accuracy (total F1 scoreĀ >95%) and low cost (GPT usage cost of 5.97 US dollars and labeling cost of 10 I/O paired examples). The statistical analyses conducted on the database generated by MaTableGPT revealed valuable insights into the distribution of the overpotential and elemental utilization across the reported catalysts in the water splitting literature.

Authors

  • Gyeong Hoon Yi
    Computational Science Research Center, Korea Institute of Science and Technology, Seoul, 02792, Republic of Korea.
  • Jiwoo Choi
    Computational Science Research Center, Korea Institute of Science and Technology, Seoul, 02792, Republic of Korea.
  • Hyeongyun Song
    Computational Science Research Center, Korea Institute of Science and Technology, Seoul, 02792, Republic of Korea.
  • Olivia Miano
    Global Security Computing Applications Division, Lawrence Livermore National Laboratory, Livermore, CA, 94550, USA.
  • Jaewoong Choi
    Computational Science Research Center, Korea Institute of Science and Technology, Seoul, 02792, Republic of Korea.
  • Kihoon Bang
    Computational Science Research Center, Korea Institute of Science and Technology, Seoul, 02792, Republic of Korea.
  • Byungju Lee
    Research Institute of Advanced Materials (RIAM), Department of Materials Science and Engineering, Seoul National University 1 Gwanak-ro, Gwanak-gu Seoul 151-742 Republic of Korea matlgen1@snu.ac.kr.
  • Seok Su Sohn
    Department of Materials Science and Engineering, Korea University, Seoul, 02841, Republic of Korea.
  • David Buttler
    Center for Applied Scientific Computing, Lawrence Livermore National Laboratory, Livermore, CA, 94550, USA.
  • Anna Hiszpanski
    Materials Science Division, Lawrence Livermore National Laboratory, Livermore, CA, 94550, USA.
  • Sang Soo Han
    Computational Science Research Center, Korea Institute of Science and Technology, Seoul, Republic of Korea. sangsoo@kist.re.kr.
  • Donghun Kim
    Computational Science Research Center, Korea Institute of Science and Technology, Seoul, Republic of Korea. donghun@kist.re.kr.

Keywords

No keywords available for this article.