Site icon IT World Canada

If you build it, a data lakehouse, they will come

Shutterstock

If your IT department builds a data lakehouse, will business end-users come? Unfortunately, some CIOs forget they’re not working with Kevin Costner on a sequel to the Field of Dreams movie. Instead, they are sucked into sponsoring an enterprise data lakehouse project by their IT staff. A data lakehouse combines the low operating cost of data lakes with data warehouses’ data management and structure features on one platform.

These CIOs are genuinely shocked when almost no one cares, or wants to come and use the shiny new data lakehouse for business intelligence (BI) applications. They are more astounded when the organization complains about wasted money. CIOs expected the organization would sing their praises for the initiative to improve data integration, accessibility and analytics.

What could possibly have gone wrong?

IT sponsorship vs. business sponsorship

When a well-intentioned CIO sponsors a data lakehouse project, the project will operate without the following:

  1. Essential high-level guidance about business priorities that senior management provides.
  2. Support of middle management to allocate resources to improve data quality.
  3. Involvement of business analysts to understand the detailed business requirements.

A data lakehouse project dominated by IT leadership will lose momentum as development costs climb and no end-user valuable deliverables such as reports and charts are being produced. Eventually, the project is cancelled, and the reputation of the IT leadership takes a hit.

A superior approach is to build BI applications with business sponsorship supported by IT leadership. Now the priority is to address specific business problems or priorities, not IT’s assumptions about business data and requirements. The stakeholders understand that the underlying data lakehouse is critical supporting infrastructure. However, that infrastructure does not dominate the project.

Technology focus vs. business benefit focus

A data lakehouse project dominated by IT staff will tend to use the latest technology for developing and operating a data lakehouse, data lake or data warehouse. This focus occurs because the staff:

  1. Is convinced the latest technology will best support robust BI applications.
  2. Typically builds robust custom applications with extensive data validation, operational features, security and backup/recovery included.
  3. Enjoys exploring the newest technology.
  4. Is building their resumes in anticipation of a call from a headhunter.

A dramatically cheaper approach to building BI applications is to leave as much of the data in the operational datastores (ODS) where it resides. Only copy and transform data to a data lakehouse if the ODS structure is seriously unworkable in a BI context. This approach leaves more project budget to develop BI reports and charts that deliver the needed business benefits.

Simple data sources vs. valuable data sources

A data lakehouse project dominated by IT staff will tend to import simple internal data sources into the data lakehouse because the development effort is low. Also, the IT staff is typically unaware of useful external data sources.

A superior approach to building BI applications is collaborating with business analysts to rank data sources in decreasing order of business value. Then add the internal or external data sources to the BI environment one at a time as a new release. Only add another data source once most of the previous release’s BI reports and charts have been completed. This approach minimizes time to value, ensures the most business value is achieved, and maintains stakeholder support for the BI project.

Elaborate architecture vs. minimal architecture

Dominating IT architects will design a data lakehouse using an idealized framework. The resulting architecture is often too elaborate to understand easily, challenging to load and expensive to maintain.

A superior approach to architecting a data lakehouse environment is to balance trade-offs among the following design goals carefully:

  1. Query performance.
  2. Minimizing the amount of data copied and transformed from ODSs.
  3. Query development complexity.
  4. Data lakehouse load complexity.
  5. Operating and maintenance costs.

Every design idea that improves query performance, even if it adds complexity to the data lakehouse load, is worth implementing. Allowing idealized frameworks, though widely admired, to dominate the design is always a bad idea.

Data quantity vs. data quality

A data lakehouse project sponsored by the CIO will gravitate toward data quantity for the data lakehouse because the team doesn’t know which data sources are most helpful.

However, this quantity approach is blind to data quality issues. These issues will slow or inhibit the:

  1. Acceptance of the data lakehouse as a functional BI environment.
  2. Development of enterprise and departmental BI applications.

Poor data quality first manifests itself through these IT technical issues:

  1. Hindering integrating data from multiple sources.
  2. Creating summation errors.
  3. Causing software crashes.
  4. Causing system performance problems.

Then poor data quality leads to these business issues:

  1. A lack of confidence in reports and charts.
  2. Uninformed or misinformed decision-making that adds risk.
  3. Inaccurate problem analysis that adds cost.
  4. Poor customer relationships that reduce sales and market share.
  5. Disappointing product launches that slow growth.

A data lakehouse project sponsored by the CIO has no clout with the business to address data quality issues. The project will fail because the end-user-visible deliverables are sparse and not helpful.

A superior approach to building BI applications is to:

  1. Prioritize data sources for inclusion in the BI project based on business value.
  2. Expect data quality issues and allocate business resources to improve data quality.
  3. Assess data sources for data quality issues. To reduce time to value, fix easy data problems first.

This approach ensures that the BI reports and charts are accurate and will build confidence in the BI applications.

Data inconsistencies vs. data standards

Data inconsistencies make integration difficult, complicate query development and slow query performance. Inconsistencies can occur in reference, master and transaction data. For example:

  1. Incompatible identifiers for key data such as vendor or product across IT systems.
  2. Variations exist on $1000, such as 1,000, 1000 CDN, CDN 1000, 1000.00 or “one thousand dollars.”
  3. Variations are found in units of measure abbreviations such as kg, Kg, kilogr, and KG.
  4. Numbers are not left zero-filled.
  5. Text is right justified as opposed to left justified.
  6. Multiple date formats are used.
  7. The letter O is used instead of zero.
  8. Incorrect conversions between EBCDIC and ASCII are evident.

A data lakehouse project may ignore this issue because it complicates the ETL software that integrates data from diverse data sources. However, the result is that the data lakehouse is unusable by end-users. The organization is better served by the CIO championing the setting of data standards.

To ensure that they, the business end-users, will come, the CIO should quit listening to his ambitious techies and champion building BI applications with business sponsorship supported by IT leadership.

What ideas can you contribute to help organizations build BI applications that deliver business value? We’d love to read your opinion. You can share that with us below. Select the checkmark for agreement or the X for disagreement. In either case, you’ll be asked if you also want to send your comments directly to our editorial team.

Exit mobile version