Data scientists’ actions are out of focus

Yogi Schulz

6 years ago

Many data scientists appear firmly focused on chasing excessive complexity by:

Accessing more data volume and more data sources.
Making their models fancier and fancier.
Consuming large amounts of compute resources.

Data scientists seem convinced these actions are essential prerequisites to finding new insights that provide business value. But what if this drive to complexity is often out of focus?

What if valuable insights can just as easily be identified by sticking with close-at-hand data and by viewing the current model as sufficient? What if most of the business value is actually achieved by turning data science projects into easily accessible production-quality applications?

More data does not achieve better insights

Data is the raw material that almost always contains useful insights. The challenge is that these valuable insights are often well camouflaged in the data. Beyond a surprisingly easy-to-reach point, more data volume or more data sources won’t:

Produce superior or groundbreaking insights.
Result in a larger number of insights.
Mean you’re able to squeeze more value from your data.

However, better understanding of your current data often leads to better insights. So, before you rush away looking for more data volume or more data sources, how well do you understand:

How the various pieces of your data were collected?
How your data was manipulated, aggregated or filtered before it came into your possession?
What the quality limitations of your data are?
If the data is appropriate for your data science hypothesis?
If the data is incomplete, skewed or biased?
How well the data structures align with your data science hypothesis?
How different the goal was for collecting the data from your data science hypothesis now?
How you can verify or validate your data?
Your data cleaning strategy? You will eventually have to clean some data even if you don’t think you will need to at the beginning of your data science project.
The rationale for omitting outlier data?

Addressing the issues revealed by the answers to these and similar questions can lead you to a superior data solution without adding more data volume or additional data sources.

Fancier models do not achieve better insights

To extract insights from your data, you need a model that is defined by software. A better model design can produce better insights.

However, before you make your model fancier, are you sure that you:

Understand the problem statement well?
Have crafted a clear hypothesis for which you can reach an unequivocal conclusion with the data and model that you have at hand
Have adequately considered the business context and not just the algorithms in your model design?
Have developed an approach to testing or verifying your model for coherence?
Have conducted some exploratory and sensitivity analysis to confirm your model design?
Understand the uncertainty, confidence level, and error term inherent in your model?
Have assessed your model design for the presence of potentially misleading statistics?
Have considered how you will move the model into production?

Responding to the answers for these and similar questions can lead you to a superior and defensible model without making your model fancier.

Huge computational resources do not achieve better insights

To extract insights from your data, you must obviously run your model and that action consumes computational resources. Many data scientists appear to believe you need tons of computational resources to run the model repeatedly to better understand model accuracy and sensitivity as key variable values are varied.
However, before you consume tons of computational resources to run your model with all its associated data, have you assessed how your model might still be effective while consuming fewer resources:

If the time interval in your data became longer?
If one or more data sources were removed?
If the date range of your data were reduced?
If use of floating-point arithmetic could be reduced or eliminated in favor of real number arithmetic?
If the underlying databases are better optimized for performance?
If the arithmetic could be simplified by respecting significant digits?
If the algorithms in the model could be simplified?

Responding to the answers for these and similar questions can lead you to an optimized and adequate model that consumes significantly less computational resources without undermining the business value of the model.

How would you push back when your data scientists keep asking for more and more data? How would you respond if your data scientists present a list of reasons to make the model fancier? What would you say if your data scientists want to spend more money on cloud compute resources? Let us know in the comments below.