Publication
HPCC/SmartCity/DSS 2016
Conference paper

A machine-learning approach to automatic detection of delimiters in tabular data files

View publication

Abstract

Detection of string and column delimiters is a critical first step in the automated ingestion of files containing tabular data. In this paper we present an algorithm that uses a logistic-regression classifier to evaluate whether a particular choice of delimiters is correct. The delimiter choice that is given the highest score by the classifier is chosen as the one most likely to be correct. The algorithm makes the correct choice over 90% of the time on a test data set of files with a variety of different delimiters.

Date

20 Jan 2017

Publication

HPCC/SmartCity/DSS 2016

Authors

Share