What is it about?

Manual, library-style techniques for classifying digital records do not scale in this era of Big Data. Legislation offering users better rights over their data (such as GDPR) has only increased the compliance needs for organizations in charge of records data. This paper assesses a broad variety of algorithms for their skill at classifying real-world records based on their text content. A focus group of records professionals then discussed the findings and their usefulness and trustworthiness for their day-to-day work.

Featured Image

Why is it important?

This is perhaps the first study to compare the classification performance of statistical models, a variety of neural network architectures, and pre-trained language models on real-world records data. The newer technologies are more skilled but the older algorithms are still competitive, and are cheaper to run at scale. A focus group of records professionals are optimistic about the adoption of these technologies into their workplaces and see it as a first step in a journey be able to synthesize meaningful narrative out of a vast body of records.

Perspectives

People think records are dull reams of old paper, but they are much more than that. Records can be any form of data and they contain the whole of human knowledge. I hope this study will help to show the potential of machine learning systems to unlock insights from the ever-greater volumes of records that we generate as we navigate the challenges of the 21st century.

Jason Franks
Monash University

Read the Original

This page is a summary of: Text Classification for Records Management, Journal on Computing and Cultural Heritage, September 2022, ACM (Association for Computing Machinery),
DOI: 10.1145/3485846.
You can read the full text:

Read

Resources

Contributors

The following have contributed to this page