Data wrangling and exploratory data analysis explained

Novice data scientists sometimes have the notion that all they need to do is to find the right model for their data and then fit it. Nothing could be farther from the actual practice of data science. In fact, data wrangling (also called data cleansing and data munging) and exploratory data analysis often consume 80% of a data scientist’s time.

Despite how easy data wrangling and exploratory data analysis are conceptually, it can be hard to get them right. Uncleansed or badly cleansed data is garbage, and the GIGO principle (garbage in, garbage out) applies to modeling and analysis just as much as it does to any other aspect of data processing.

What is data wrangling?

Data rarely comes in usable form. It’s often contaminated with errors and omissions, rarely has the desired structure, and usually lacks context. Data wrangling is the process of discovering the data, cleaning the data, validating it, structuring it for usability, enriching the content (possibly by adding information from public data such as weather and economic conditions), and in some cases aggregating and transforming the data.

Exactly what goes into data wrangling can vary. If the data comes from instruments or IoT devices, data transfer can be a major part of the process. If the data will be used for machine learning, transformations can include normalization or standardization as well as dimensionality reduction. If exploratory data analysis will be performed on personal computers with limited memory and storage, the wrangling process may include extracting subsets of the data. If the data comes from multiple sources, the field names and units of measurement may need consolidation through mapping and transformation.

What is exploratory data analysis?

Exploratory data analysis is closely associated with John Tukey, of Princeton University and Bell Labs. Tukey proposed exploratory data analysis in 1961, and wrote a book about it in 1977. Tukey’s interest in exploratory data analysis influenced the development of the S statistical language at Bell Labs, which later led to S-Plus and R.

Exploratory data analysis was Tukey’s reaction to what he perceived as over-emphasis on statistical hypothesis testing, also called confirmatory data analysis. The difference between the two is that in exploratory data analysis you investigate the data first and use it to suggest hypotheses, rather than jumping right to hypotheses and fitting lines and curves to the data.

Copyright © 2021 IDG Communications, Inc.

Source link