What is it about?

The researchers wanted to create a better machine translator for Cantonese to English. They faced a challenge because there isn't a lot of data available for this specific language pair. To overcome this, they combined different online resources to create a larger dataset. They also used techniques like back-translation and model switching to improve the translator's performance. They tested their new translator against existing commercial options. The results showed that their model performed similarly or even better in terms of accuracy, measured by various metrics. Finally, they created an online tool so people can try out their translator and compare it to others.

Featured Image

Why is it important?

Cantonese, a Sinitic language spoken primarily in Hong Kong, Macau, and southern China, is significantly understudied in Natural Language Processing despite its vast number of native speakers (approximately 80 million). While it ranks second in terms of native speakers among Sinitic languages, the ACL Anthology reveals a stark disparity in research: only 47 papers focus on Cantonese compared to 2355 for Mandarin Chinese. This scarcity of research is reflected in the quality of commercial translation services, many of which either lack Cantonese support or offer subpar translations to English. This limitation poses challenges for individuals seeking Cantonese resources, especially in informal contexts where tonal nuances are crucial for accurate understanding.

Perspectives

The research focused on data augmentation techniques (back- and forward-translation) for synthetic data generation and model-switch mechanisms for Cantonese-to-English Neural Machine Translation (NMT). Additionally, the project resulted in the open-source release of the CANTONMT toolkit and the underlying corpora. The four primary objectives outlined in the initial section were successfully accomplished: A new parallel dataset was constructed by merging existing corpora, offering a valuable resource for researchers in Cantonese Natural Language Processing (NLP). Additionally, a substantial Cantonese monolingual dataset was scraped and processed from an online forum, including anonymization and data cleaning. This dataset is particularly valuable for future research, such as training Cantonese word embeddings for downstream NLP tasks, given its unique nature and limited public availability. Model Investigations: Novel Methods for Cantonese-English Translation: Several models were developed and trained using back-translation to generate a synthetic parallel corpus for fine-tuning. The best-performing models achieved comparable results to state-of-the-art commercial translators like Microsoft Bing and Baidu Translators, despite limitations in computational resources and corpus size. This work is pioneering in applying back-translation-generated synthetic data and model switch mechanisms to enhance Cantonese-English Neural Machine Translation (NMT) in the field. Experimental Evaluations: Both Automatic and Human Evaluations: Extensive experiments were conducted to evaluate the performance of trained models and commercial translators, employing a variety of automatic evaluation metrics from lexicon-based (BLER, hLEPOR) and embedding-space (COMET, BERTscore) categories. Moreover, a modified HOPE framework was used for human evaluation, providing a clearer understanding of each model's strengths and weaknesses. CANTONMT User Interface: A highly modular, full-stack web application was designed and developed as an open-source translation platform. This platform serves as a toolkit for researchers to add different models and language pairs, fulfilling the investigation's objective of creating an open-source Cantonese-to-English translation tool.

Lifeng Han
University of Manchester

Read the Original

This page is a summary of: C anton MT: Investigating Back-Translation and Model-Switch Mechanisms for Cantonese-English Neural Machine Translation, ACM Transactions on Asian and Low-Resource Language Information Processing, October 2024, ACM (Association for Computing Machinery),
DOI: 10.1145/3698236.
You can read the full text:

Read

Resources

Contributors

The following have contributed to this page