About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
ICDMW 2016
Conference paper
Using Machine Learning to Accelerate Data Wrangling
Abstract
70% Of the time spent on data analytics is not actually spent on data analytics, but rather, in data wrangling: The process of finding, interpreting, extracting, preparing and recombining the data to be analyzed. For data that is collected as free-form text, the lack of standards or competing standards often results in a variety of formats for expressing the same type of data, making the data wrangling step a tedious and error-prone process. For example, US street addresses may be expressed with a house number, PO Box, rural or military route, and/or a direction-All of which can be abbreviated or spelled out in a variety of ways. In this paper, we present an algorithm that uses machine learning to efficiently and automatically identify categories of attributes, such as geo-spatial, that are present in a data file and we discuss results on a variety of real data sets. Our implementation can be used to automatically prepare data for consumption by other tools and services, such as mapping and visualization tools, and is motivated by and in support of a customizable severe weather alerting service.