What is it about?
Data is essential for building effective machine learning models. This paper presents methods to generate synthetic hate speech datasets in languages with a small amount of training data. Our goal is to leverage the comparably greater availability of hate speech datasets in a highly-resourced language like English to boost the training set for other limited-resourced languages. We tested our hypothesis of semi-heuristically transferring hateful sentiment across languages while incorporating context tokens of the limited-resource setting. To this end, we built synthetic hate speech detection datasets in Hindi and Vietnamese via a combination of machine translation, contextual entity substitution, and language generation. Our results show improved performance in our entity substitution method over machine translation. We present and discuss scenarios in which this method outperforms a language generation model and vice versa.
Featured Image
Photo by Jon Tyson on Unsplash
Why is it important?
Hate speech datasets are costly (in terms of time, mental effort, and money) to curate. This makes it challenging to scale, sustain, and replicate the success of hate speech detection efforts since datasets curated in one context cannot be easily transferred to new contexts. Our work introduces a method for rapidly bootstrapping hate speech data in a new context using existing data from high-resource contexts. Experiments conducted in the Hindi and Vietnamese languages show that our method produces a more useful dataset than simply translating the dataset to the language interest. This contribution is important to researchers, practitioners, and civil society organizations working on automatic approaches for detecting and responding to problematic content online and encountering limited data roadblocks, especially within low-resource language contexts.
Read the Original
This page is a summary of: Hate Speech Detection in Limited Data Contexts using Synthetic Data Generation, ACM Journal on Computing and Sustainable Societies, October 2023, ACM (Association for Computing Machinery),
DOI: 10.1145/3625679.
You can read the full text:
Resources
Hate Speech Detection in Limited Data Contexts using Synthetic Data Generation
Paper Presentation at ACM SIGCAS/SIGCHI Conference on Computing and Sustainable Societies, August 2023, Cape Town, South Africa. Presented in the session on Machine Learning and AI.
Hate Speech Detection in Limited Data Contexts using Synthetic Data Generation
A growing body of work has focused on text classification methods for detecting the increasing amount of hate speech posted online. This progress has been limited to only a select number of highly-resourced languages causing detection systems to either under-perform or not exist in limited data contexts. This is majorly caused by a lack of training data which is expensive to collect and curate in these settings. In this work, we propose a data augmentation approach that addresses the problem of lack of data for online hate speech detection in limited data contexts using synthetic data generation techniques. Given a handful of hate speech examples in a high-resource language such as English, we present three methods to synthesize new examples of hate speech data in a target language that retains the hate sentiment in the original examples but transfers the hate targets. We apply our approach to generate training data for hate speech classification tasks in Hindi and Vietnamese. Our findings show that a model trained on synthetic data performs comparably to, and in some cases outperforms, a model trained only on the samples available in the target domain. This method can be adopted to bootstrap hate speech detection models from scratch in limited data contexts. As the growth of social media within these contexts continues to outstrip response efforts, this work furthers our capacities for detection, understanding, and response to hate speech.
Contributors
The following have contributed to this page