We’ve all read articles extolling the limitless potential of Artificial Intelligence (AI) and its cousin, machine learning. As a data analyst, you may be nervous about the sluggish progress your AI project team is making after having slid by multiple milestones. You are worried that management will lose patience and cancel what you see as an exciting project with tangible business value.
Most likely, your problem is the tedium, high effort and unpredictability of data wrangling that causes:
- Stalled AI projects.
- Increased AI project costs.
- Doubts around AI project insights and recommendations.
- Disappointing benefits from AI projects.
Here are the considerations that will position the data for your AI project for success.
Data wrangling struggle
Data wrangling refers to all the effort data scientists and software developers invest in data preparation before the actionable insights you hope to gain from data exploration and data analytics will be revealed.
You are acutely aware that data wrangling has been consuming 60 – 80% of the effort of data analysts and data scientists for decades now. The percentage has not decreased over time despite vastly improved software and an increased emphasis on data stewardship in many organizations. The percentage has not decreased because data volumes have grown exponentially over the same time period. This data volume growth has consumed the value of all the data management improvements that many organizations have implemented.
Improving data wrangling
The following categories of software can help you bring data wrangling under control:
- Data visualization.
- Data preparation.
- Extract, Transform, and Load (ETL).
- AI-driven data improvement insights.
- Robotic Process Automation (RPA).
- Custom SQL.
You can speed up and improve data wrangling by using this software and implementing clearer data improvement steps to make data preparation:
- Less tedious for expensive, experienced data experts.
- Less consumptive of staff effort overall.
- Quicker and therefore cheaper.
- More likely to achieve significant data quality improvements.
- More operationalized to run on a predefined schedule, in a governed and scalable manner.
All the steps below are performed to varying degrees by every AI project before moving on to the data analytics step that produces the business value. If you can’t relate the planned work of your project team to these steps, then it’s highly likely that your AI project is inadequately organized and is at risk of not reaching the project goal.
Discovering data
Data discovery is the first step. Its goal is to identify the data sources required to achieve the goal of your AI project. Typical data sources include:
- Operational datastores related to on-premise applications and to externally-hosted Software-as-a-Service (SaaS) applications.
- Small informal datastores, usually Excel workbooks, that house crucial bits of information.
- Data warehouses and data lakes.
- Public free and subscription data.
The benefit of data discovery is that it provides an initial indication of the:
- Extent to which your organization’s data can support the goal of the AI project.
- Likely data preparation effort.
Data discovery is typically performed by analysts interviewing business and IT staff and does not require software support.
If your AI project team hasn’t quickly identified the required data sources, then your project is off to a poor start. You may be dealing with:
- Fundamental disagreements about the project goal and how to reach it.
- An overly ambitious AI project goal.
- Significant gaps in your application portfolio that must be addressed before the AI project can succeed.
Profiling data
Data profiling is the process for detecting the magnitude of various data problems in every selected data source. Data problems typically include:
- Missing data – either missing rows or nulls in columns where a value must exist.
- Invalid data –values that are not on the list of allowed values for a column or values that are outside the allowed range of values.
- Invalid relationships – where one of two values in two related columns in a row or in related tables cannot be accurate.
- Inconsistent data – where values of the identical column in multiple datastores are not the same.
- Coverage limitations – when some data sources contain only a geographic or date range subset of the data.
- Misleading patterns – values across many rows that are likely erroneous.
- Unlikely outliers – values or rows that are not interesting aberrations but are more likely the result of problems in the data acquisition process.
Data profiling results are used to plan the work to improve the quality and completeness of the data.
Data profiling is too often performed by simple visual inspection of the data by experts in the data. This approach is tedious, expensive and the results can be uneven. Data profiling can be performed more quickly and consistently using simple SQL queries and data visualization software to highlight and quantify data problems for resolution.
If your AI project team appears to be stuck in endless data profiling work, then the team needs to focus on how to enhance the automation of this step.
Structuring data
Structuring the data includes the following tasks:
- Change values in all date columns to a common data type and date format.
- Change values in key columns to the agreed format. This action often means reordering values in compound keys.
- Change the same column in all selected data sources to the same name and data type.
- Change values in numeric columns to conform to the standard for real or scientific notation and for the number of decimal places.
- Recalculate values that don’t conform to the unit of measure standards for the organization.
- Parse values from unstructured text columns into new columns.
The benefit of structuring the data includes minimizing the risk that:
- Data scientists will inadvertently add incompatible numeric values together that first need to be recalculated.
- Data scientists will perform date arithmetic on incompatible date values.
- Confusion about column definition and contents will produce misleading results.
- Joins will produce nonsensical results or fail when they should not.
Structuring the data is too complex to perform manually. Data preparation software can handle most of the structuring. Custom SQL software is typically developed for the rest.
If your AI project team appears to be struggling to complete data structuring, often the problem is that the team is immersed in endless debates about ideal data standards. As a data analyst, you can remind them that project schedule is important and that compromises are acceptable within the project scope.
Cleaning data
Cleaning up the data corrects the problems that the data profiling step identified. This step is traditionally the most time-consuming part of the data preparation process. Data clean up tasks implement master data management concepts. Major tasks include:
- Filling in missing values.
- Correcting invalid data and relationships.
- Selecting the best value when inconsistent data exists.
- Removing duplicate, extraneous and outlier data rows.
- Masking private or sensitive values.
The benefit of cleaning up the data is that it ensures high confidence, reproducible insights from data analysis.
The following software are typically used where possible to automate the clean-up:
- Data preparation.
- Custom SQL software.
- A user-assisted RPA bot for trickier or ambiguous cases.
The remaining data clean-up, that is not amenable to automation, is performed manually by data experts. However, organizing the work into these steps and applying software dramatically reduces the effort and improves the data quality.
The clean-up results can introduce a risk of challenge to the data analytics conclusions. It’s important that the software used for this task produce a detailed audit trail of changes made for manual review.
Manual review of clean-up results by data experts will increase the assurance that the data analytics conclusions are credible.
If your AI project team appears to be daunted by the amount of data cleanup work, you can help the team to prioritize the work and back off self-imposed perfection.
Validating data
Once data has been cleaned up, it must be validated by testing for errors introduced or missed by the data preparation process up to this point. Validation is typically used to:
- Confirm the accuracy and completeness of the data preparation work.
- Identify more subtle, more complex, or more hidden problems in the data.
The benefit of validating the data is that previously unrecognized errors in the data and in the design of the data integration will become apparent and can then be corrected.
Validation is typically performed quickly and consistently using data visualization software and ETL software to highlight data integration problems for resolution by data experts.
If your AI project team insists that all the errors identified during validation must address before the AI project can move forward, then help the team to prioritize the errors and back off self-imposed over-validation.
Enriching data
Enriching data adds value to the data in the datastores. Data can be enriched in a number of ways including:
- Integrating tables of external data that supplement internal data.
- Performing code lookups and storing the retrieved values as a calculated column.
- Adding calculated columns and persisting them in the data.
- Creating new tables of aggregated data, sometimes known as data marts, where the keys to the summarization are predictable.
- Performing joins and persisting the results in new tables.
The benefit of this enrichment step is to greatly speed up subsequent data analytics processes. The cost to create the enriched data is typically a small fraction of the benefit to the many data analytics processes that are often repeated multiple times.
The following software are typically used to enrich data:
- ETL software is used to integrate the external data and create the aggregated tables.
- RPA bots can handle trickier cases where logic must be evaluated.
- Custom SQL software is typically developed for performing code lookups and joins.
If your AI project team has added too much scope to enriching data, then help the team to prioritize the ideas and leave some of the ideas for another day.
Now your AI project is ready for the data analytics step that produces the business value.
What strategies would you recommend that can reduce tedium, effort, and cost of data wrangling to clean datastore contents? Let us know in the comments below.