What is it about?

Data is essential for building effective machine learning models. This paper presents methods to generate synthetic hate speech datasets in languages with a small amount of training data. Our goal is to leverage the comparably greater availability of hate speech datasets in a highly-resourced language like English to boost the training set for other limited-resourced languages. We tested our hypothesis of semi-heuristically transferring hateful sentiment across languages while incorporating context tokens of the limited-resource setting. To this end, we built synthetic hate speech detection datasets in Hindi and Vietnamese via a combination of machine translation, contextual entity substitution, and language generation. Our results show improved performance in our entity substitution method over machine translation. We present and discuss scenarios in which this method outperforms a language generation model and vice versa.

Featured Image

Why is it important?

Hate speech datasets are costly (in terms of time, mental effort, and money) to curate. This makes it challenging to scale, sustain, and replicate the success of hate speech detection efforts since datasets curated in one context cannot be easily transferred to new contexts. Our work introduces a method for rapidly bootstrapping hate speech data in a new context using existing data from high-resource contexts. Experiments conducted in the Hindi and Vietnamese languages show that our method produces a more useful dataset than simply translating the dataset to the language interest. This contribution is important to researchers, practitioners, and civil society organizations working on automatic approaches for detecting and responding to problematic content online and encountering limited data roadblocks, especially within low-resource language contexts.

Read the Original

This page is a summary of: Hate Speech Detection in Limited Data Contexts using Synthetic Data Generation, ACM Journal on Computing and Sustainable Societies, October 2023, ACM (Association for Computing Machinery),
DOI: 10.1145/3625679.
You can read the full text:

Read

Resources

Contributors

The following have contributed to this page