OAK

Natural Language Processing Methods for Analysis of the Effects of Plants on the Human Body in Biomedical Literature

Metadata Downloads
Author(s)
Baeksoo Kim
Type
Thesis
Degree
Doctor
Department
대학원 전기전자컴퓨터공학부
Advisor
Lee, Hyunju
Abstract
Many new medicines have been derived from natural sources such as plants, which have a long history of being used for disease treatment. Thus, their benefits and side effects have been studied, and plant-related information, including plant and disease relations, have been accumulated in Medline articles. Analyzing the known relation between plants and the human body be the start of new medicines research using plants. Because numerous articles are available in Medline and are written in natural language, text-mining is essential. However, there are not enough corpus to study and evaluate natural language processing techniques. In particular, corpus production in biomedical fields should be annotated by experts with background knowledge of biomedical domains, unlike in general fields. Since corpus annotation in biomedical fields requires much expert effort, the corpus has been created related popular concepts such as genes, diseases, and chemicals.

In this dissertation, we examined the perspective of disease and phenotype on the effect of plants on the human body. We also studied the relations between phytochemicals and phenotypes, which are chemicals in plants. We defined the definition of each entity and the relations between the entities and annotated the abstract of Medline. The corpus was commented by two or more experts to ensure corpus credibility. After that, we propose the entity name recognition technique and the relation extraction technique using the created corpus. The relation extraction technique proposes a convolution neural network technique using the characteristics of the shortest dependency path and a fine-tuning model using an unsupervised representation learning model. The entity name recognition scheme proposes a fine-tuned model using an unsupervised representation learning model. In this dissertation, we applied the entity name recognition model and the relation extraction model trained using the proposed method and the plant-related corpus to the entire Medline data. We analyzed the characteristics of plant names to reduce false positives by identifying sentences, abbreviations, and assigning unique numbers to extracted objects.

The proposed method extracts plant-related diseases and phenotypes from Medline abstract, and the case study showed the possibility of using the proposed method. The plant-related corpus proposes a new task in biomedical natural language processing and hopes to contribute to the study of the medicinal effects of plants through natural language processing.
URI
https://scholar.gist.ac.kr/handle/local/32927
Fulltext
http://gist.dcollection.net/common/orgView/200000908206
공개 및 라이선스
  • 공개 구분공개
파일 목록
  • 관련 파일이 존재하지 않습니다.

Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.