Detecting CSV file dialects by table uniformity measurement and data type inference

Wilfredo García

doi:10.3233/ds-240062

What is it about?

This paper introduces a method for automatically infer CSV file configuration at data import stage. The proposed method outperform state of the art solutions like CleverCSV in the used dataset. The method is robust enough to get very accurate results using a little data sample.

Photo by Mika Baumeister on Unsplash

Why is it important?

This methodology represents a significant advancement in CSV dialect detection. Its core strength lies in its unique approach of evaluating table uniformity across multiple parsing attempts. This approach, combined with the use of data type detection at the table scoring stage, makes it demonstrably more accurate, especially when dealing with the messy and inconsistent CSV files often encountered in real-world scenarios. This is particularly important in data science, data engineering, and other fields where working with diverse and often messy datasets is common.

Perspectives

The Table Uniformity method, presented in this research paper, is undoubtedly a leading solution and a substantial improvement over existing methods. Its research backing, unique approach, and practical advantages make it the preferred choice for most CSV dialect detection needs. It effectively bridges the gap between basic built-in tools and more complex, database-integrated solutions, offering a highly accurate, efficient, and versatile solution for a common data processing problem.
Ing. Wilfredo García
ECP Solutions

This page is a summary of: Detecting CSV file dialects by table uniformity measurement and data type inference, Data Science, July 2024, SAGE Publications,
DOI: 10.3233/ds-240062.
You can read the full text:

Read

Resources

Image
Table Uniformity variation
Scoring variation of three different delimiters and their dialects when applying the uniformity heuristic over tables from the dd_Wickenburg_nobmp_623.csv file.

Contributors

The following have contributed to this page

Ing. Wilfredo García
ECP Solutions

Loading...

Automatic detection of CSV file's import configuration

What is it about?

Why is it important?

Perspectives

Resources

Table Uniformity variation

Contributors

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management

Loading...

Automatic detection of CSV file's import configuration

What is it about?

Featured Image

Why is it important?

Perspectives

Read the Original

Resources

Table Uniformity variation

Contributors

Share this page:

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management