What is it about?

Until recently, most of our documentation, from receipts and financial records to healthcare documents, has been in the form of physical paper documents. A wealth of raw data and information is available in these documents, often in tables. Since tables are the most compact method of representing relational data, their data must be made available in an indexable and searchable format. Our work focuses on precisely recognizing tabular data in a scanned document and then extracting this into a standard format like CSV or Excel while preserving the table's structure.

Featured Image

Why is it important?

It is highly inefficient and costly to manually navigate through large numbers of document images to search for data about something specific. Moreover, the time manual labor required to identify the needed document is not feasible in large organizations with ever-growing data. Our paper proposes a cost-effective, time-feasible approach to save organizations time, money, and effort. It attempts to solve a real-life industry problem and create a meaningful impact using technology.

Perspectives

The problem of table extraction from printed documents and images sounds trivial, but we realized many challenges once we started working on this problem. Overcoming all those challenges was an enriching experience.

Dr. Sanjay Singh
Manipal Institute of Technology, Manipal

This method is computationally efficient while producing good results at the same time.

Tanvi Anand
Manipal Institute of Technology

I have a different view of our work. Keeping aside the corporate aspect of document digitization, like making data available for applications like Business Intelligence to improve organizational decision-making, our work can also have a significant societal impact. Most notable in this could be leveraging this data for better healthcare outcomes, which can benefit developing nations like India, which have low doctor-to-patient ratios. Data extracted from health records can be used to create disease progression models. These models can be shared to aid doctors in diagnostics and be used in telemedicine to provide preliminary care. To summarize, the vast amounts of data available to us in the form of physical documents can be used to improve healthcare and other facilities like finance for everyone.

Tejas Kashinath
University of Southern California

We hope this paper is helpful to organizations that already have or are looking to shift toward digitization. The driving force for this paper has been the will to ease the workload of employees by automating the time-consuming task of table identification and extraction from document images. Working on this paper has been a memorable experience and has helped me strengthen my understanding of machine-learning concepts. I have relished the opportunity of collaborating with my intellectual co-authors and am thankful for this exposure to research I have received.

Twisha Jain
Columbia University

Read the Original

This page is a summary of: End-to-end table structure recognition and extraction in heterogeneous documents, Applied Soft Computing, May 2022, Elsevier,
DOI: 10.1016/j.asoc.2022.108942.
You can read the full text:

Read

Contributors

The following have contributed to this page