BERT Fine-Tuning the Covid-19 Open Research Dataset for Named Entity Recognition

Shin Thant; Teeradaj Racharak; Frederic Andres

doi:10.1007/978-981-99-7969-1_19

What is it about?

In this study, researchers use a powerful language model called BERT to enhance Named Entity Recognition (NER) in biomedical literature from the CORD-19 corpus. By fine-tuning BERT on a specific NER dataset, the model becomes adept at understanding the context and semantics of biomedical named entities. However, fine-tuning large datasets with such models poses challenges, and the study proposes two sampling methodologies to address this issue. The first method involves using Latent Dirichlet Allocation (LDA) topic modeling for NER on CORD-19, maintaining sentence structure while focusing on related content. The second method employs a straightforward greedy approach to extract informative data from the CORD-NER dataset across 25 entity types. The study achieves its objectives by showcasing BERT's content comprehension abilities without requiring supercomputers and transforms the document-level corpus into a valuable source for NER data, improving data accessibility. The findings of this research not only contribute to the advancement of NLP applications in various sectors but also have implications for knowledge graph creation, ontology learning, and conversational AI.

Photo by Fusion Medical Animation on Unsplash

Why is it important?

This research holds significance for several reasons: (1) Improving Biomedical Information Extraction: By utilizing advanced language models like BERT for Named Entity Recognition (NER) in biomedical literature, the study contributes to extracting more accurate and contextually relevant information from the CORD-19 corpus. This can be crucial for researchers, clinicians, and professionals in the biomedical field, enhancing their ability to access and utilize pertinent information. (2) Addressing Challenges in Large Dataset Fine-tuning: The research addresses a common challenge in fine-tuning large datasets with Large Language Models (LLMs) like BERT. The proposed sampling methodologies, including Latent Dirichlet Allocation (LDA) topic modeling, provide innovative solutions to improve the efficiency of training these models, making them more accessible and applicable to real-world scenarios. (3) Enhancing Accessibility of Biomedical Data: By converting the document-level corpus into a source for NER data, the research enhances the accessibility of biomedical data. This has implications for accelerating research, enabling the development of more advanced natural language processing (NLP) applications, and fostering collaboration within the scientific community. (4) Potential for Advancements in NLP Applications: The study's outcomes suggest that BERT-based models can comprehensively understand content without the need for supercomputers. This finding has broader implications for the progression of Natural Language Processing applications across various sectors. It opens avenues for advancements in knowledge graph creation, ontology learning, and conversational AI. (5) Contribution to Future Research Directions: The research not only demonstrates the effectiveness of the proposed methodologies but also sheds light on the potential progression of more sophisticated NLP applications. This contributes to the ongoing dialogue in the research community, guiding future studies in areas related to language models, biomedical information extraction, and NLP applications.

Perspectives

Let me share some potential perspectives and avenues for further exploration: (1) the first key aspect is the Integration with Clinical Decision Support Systems: The improved Named Entity Recognition (NER) capabilities demonstrated in this research could be integrated into Clinical Decision Support Systems (CDSS). This could enhance the systems' ability to extract relevant information from biomedical literature, providing valuable insights to healthcare professionals for making informed decisions. (2) this research is a Cross-disciplinary Collaboration: The study focused on biomedical literature, but the methodologies developed for fine-tuning Large Language Models (LLMs) like BERT could be extended to other domains. Collaborations with researchers from diverse fields could explore the applicability of these techniques in areas such as environmental science, social sciences, or humanities. (3)what about a Real-time Information Extraction for Biomedical Research: let us Investigate the potential for real-time Named Entity Recognition in biomedical literature. This could be particularly valuable for staying updated on the latest research findings and facilitating rapid responses to emerging health challenges. (4) two important dimensions: Scalability and Generalization: the Evaluation of the scalability and generalization of the proposed methodologies to datasets beyond CORD-19 will enable the assessment of the performance on diverse biomedical corpora to understand the robustness and adaptability of the fine-tuned models. (5) An important impact could focus on providing user-friendly tools for researchers: Develop user-friendly tools or interfaces that allow researchers without extensive technical expertise to leverage the advancements in NER. This could democratize access to sophisticated natural language processing techniques, making them more widely applicable in the scientific community. (6) Ethical Considerations in Biomedical NLP: Explore ethical considerations related to the use of advanced language models in biomedical research. Consider issues such as patient privacy, bias in data, and the responsible deployment of these models in healthcare applications. (7) Interactive Knowledge Extraction: Investigate methods to make knowledge extraction more interactive. This could involve creating systems where users can refine and guide the NER process based on their specific research questions, leading to a more tailored and user-centric information retrieval. (8) Benchmarking and Comparative Studies: Conduct benchmarking studies to compare the performance of the proposed methodologies with other state-of-the-art approaches in biomedical NLP. This could provide insights into the strengths and limitations of different techniques and help guide future research directions. (9) Exploration of Multimodal Approaches: Investigate the integration of multimodal information (text, images, etc.) for more comprehensive biomedical information extraction. This could involve combining language models with computer vision techniques to enhance the understanding of biomedical content. (10) Long-term Impact Assessment: Assess the long-term impact of enhanced NER capabilities on biomedical research and healthcare. Monitor how the implementation of these techniques influences the speed and quality of scientific discoveries, clinical decision-making, and overall advancements in the field.
Dr. HDR. Frederic ANDRES, IEEE Senior Member, IEEE CertifAIEd Authorized Lead Assessor
National Institute of Informatics

This article is useful for NLP and KG researchers who want to construct a Covid-19 KG from text by exploiting the language models.
Teeradaj Racharak
Japan Advanced Institute of Science and Technology

This page is a summary of: BERT Fine-Tuning the Covid-19 Open Research Dataset for Named Entity Recognition, January 2023, Springer Science + Business Media,
DOI: 10.1007/978-981-99-7969-1_19.
You can read the full text:

Read

Contributors

The following have contributed to this page

BERT Fine-Tuning the Covid-19 Open Research Dataset for Named Entity Recognition

What is it about?

Why is it important?

Perspectives

Contributors

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management

BERT Fine-Tuning the Covid-19 Open Research Dataset for Named Entity Recognition

What is it about?

Featured Image

Why is it important?

Perspectives

Read the Original

Contributors

Share this page:

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management