Benchmarking OCR Tools for Historical Postcards: A Dataset and Evaluation

Salvatore Tabbone; Matthieu Pelingre

doi:10.1145/3746273.3760201

What is it about?

This article presents a dataset consisting of 4,293 historical postcards from the Grand Est region of France, dating from 1899 to 1930. This set includes annotations for the text regions - classified as print, handwritten, and scene text - as well as manual transcriptions of the printed text for a subset of the postcards. Based on this set, we carry out in-depth benchmarking of open-source OCR models, such as EasyOCR, Tesseract OCR, docTR, PaddleOCR and Calamari, to assess their performances without fine-tuning. Our results highlight the challenges of different fonts, orientations, and image quality, with EasyOCR standing out for its accuracy in text recognition, while Tesseract OCR excels in orientation detection. The best models are then used to complete the dataset, automatically transcribing the printed text of all postcards.

Photo by Becky Phan on Unsplash

Why is it important?

Postcards are invaluable sources of historical information, but they are not structured. Furthermore, traditional algorithms, trained on contemporary sources, struggle to extract information from older documents. This paper provides a valuable resource for the analysis of historical postcards and lays the foundations for future advances in OCR adapted to historical postcards.

Perspectives

We intend to significantly expand this collection, incorporating a wider range of postcards. Furthermore, we plan to incorporate Named Entity Recognition, and keyword identification for printed texts. Then, we’ll add detailed annotations for other elements, including date stamps through segmentation/binarization and transcriptions, alongside with both handwritten and scene text transcription.
Matthieu PELINGRE
Universite de Lorraine

This page is a summary of: Benchmarking OCR Tools for Historical Postcards: A Dataset and Evaluation, October 2025, ACM (Association for Computing Machinery),
DOI: 10.1145/3746273.3760201.
You can read the full text:

Read

Resources

Data
Historical Postcards Dataset
This deposit contains Historical Postcards Dataset (COCO) — v1.0 (2025), a Common Objects in COntext (COCO) format dataset of historical postcard images and structured annotations intended for text and postal markings detection. Printed text detections include transcriptions (manual and OCR), text orientation, and OCR confidence scores — suitable for detection and historical OCR benchmarking. Transcriptions of postal markings, handwritten texts, and scene texts will be added in future versions.

Contributors

The following have contributed to this page

Matthieu PELINGRE
Universite de Lorraine

Benchmarking OCR Tools for Historical Postcards: A Dataset and Evaluation

What is it about?

Why is it important?

Perspectives

Resources

Historical Postcards Dataset

Contributors

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management

Benchmarking OCR Tools for Historical Postcards: A Dataset and Evaluation

What is it about?

Featured Image

Why is it important?

Perspectives

Read the Original

Resources

Historical Postcards Dataset

Contributors

Share this page:

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management