OAK

GIST Library Login

GIST Scholar College of Information and Computing Department of Electrical Engineering and Computer Science 3. Theses(Master)

Hate Speech Detection Using Multi-channel BERT for Different Languages and Data Augmentation

Metadata Downloads

Author(s): Hajung Sohn

Type: Thesis

Degree: Master

Department: 대학원 전기전자컴퓨터공학부

Advisor: Lee, Hyunju

Abstract: Motivation The growth of Social Networking Services (SNS) has altered the way and scale of communication in the cyberspace. However, the amount of online hate speech is increasing because of the anonymity and mobility such services provide. Hate speech is generally defined as any communication that disparages a person or a group on the basis of some characteristics such as race, color, ethnicity, gender, or other characteristics. Since manual hate speech detection by human annotators is both costly and time consuming, there are need to develop an algorithm for automatic recognition.
Methods Fine-tuning a pre-trained language model has been shown to be effective for improving many downstream tasks. BERT (Bidirectional Encoder Representations from Transformers) is a language model that is pre-trained on a very large corpus. In this study, we propose a multi-channel model of three different versions of BERT, the English BERT, the Chinese BERT, and the multilingual BERT for hate speech detection. We used the Google translation API to translate training sentences to the corresponding languages required for three different BERT models. Moreover, since the dataset size is small, we artificially augmented pseudo data by first translating the original sentence to another language and then back translating the translated sentence to its source language. Finally, the proposed a wider pooling mechanism for using more information from the pre-trained model while fine-tuning BERT.
Results In this research, we used three datasets which was in three different non-english languages. The first dataset is the 2019 SemEval HatEval Spanish dataset. By using the multi-channel BERT model, we achieved 76.9% accuracy and 76.6 F1 score. The second dataset is the 2018 GermEval shared task on the Identification of Offensive Language dataset. In this dataset, the multi-channel BERT had the best accuracy which was 80.1% and the English BERT fine-tuning has the highest F1 score which was 77.0. The third datasets was the EvalIta 2018 HaSpeeDe Italian dataset. In the third dataset, multilingual BERT fine-tuning achieved the best accuracy 82.2% and F1 score 79.9. \newline
Conclusion We propose a multi-channel model for aggregating different BERT models trained on different languages using translations. We also demonstrated the effect of data augmentation with back translation and wider pooling with BERT. As a result, we achieved the state-of-the-art or similar performance on three different hate speech datasets.

URI: https://scholar.gist.ac.kr/handle/local/32730

Fulltext: http://gist.dcollection.net/common/orgView/200000909252

Alternative Author(s): 손하정

Appears in Collections:: Department of Electrical Engineering and Computer Science > 3. Theses(Master)

메타데이터 간략히 보기메타데이터 전체 보기

공개 및 라이선스

공개 구분공개

qrcode

트윗하기

OAK GIST Scholar는 국립중앙도서관 OAK Repository 보급사업으로 구축되었습니다.