Over the past decade, powerful Big Data storage and retrieval technologies have emerged that give us added power to harvest the information cornucopia surrounding us. But as our world becomes more complex, the variety of these tools has made our choices and environments more complex.
At the CIO Peer Forum in Vancouver on April 10-11, and will explore how to de-clutter the available database styles by doing something novel: discussing what each of the technologies is NOT good at, a topic that vendors shy away from.
What am I referring to? Along with relational databases, we now have Hadoop, different flavors of NoSQL (document, key/value, wide column, etc.), graph, columnar, and hybrids of these.
Typical pontificating high-tech writers or speakers (perhaps like myself) use the Volume ‘V’ statistics of the 3, 4, or 5 V’s (Volume, Variety, Velocity, Veracity, Value) to drive home the challenge of Big Data.
If pure volume was our problem, more and faster hardware would be our answer. However, our real challenge is Variety: how do we bring new types of data together that were never meant to go together? But we never see statistics for Variety.
With the complexity caused by the variety of data, we now have the variety of new power tools to the rescue. However, each of these tools has its own complexity: its own language (GET, PUT, etc., instead of SQL, Map/Reduce or SPARK), therefore its own skills, and its own use cases.
Along with the variety of data is the variety of skills. Besides each of these languages or dialects, we have to learn about dealing with new file types (JSON, AVRO, etc.), plus there are hundreds of middleware technologies. (John Schmidt and I counted about 500 in the mid-2000’s. For every consolidation, there seems to be equal proliferation. And old technologies never die or get turned off, do they?)
I had a software manager back in my early days write an equation on the whiteboard: The Value of Software is equal to Power of Software multiplied by the Knowledge of the User. For instance, Excel is a relatively simple tool, but the value derived comes from the knowledge of the user. With more ‘powerful’ enterprise software, users may have only rudimentary training, which is why perhaps so many users stay with Excel.
So, which is better: A non-top-of-the-line tool that is widely adopted or a great tool that isn’t adopted? Obviously, you want both: a great tool that is widely adopted, but frequently teams are stuck with what’s available or what has been mandated.
Back to data store technologies: having a “decision tree” guiding what style of technology to use for each access pattern, ACID requirement, analytic need, retrieval pattern, etc., that guides some kind of common thought process, would be extremely helpful.
And it is the use cases that I find most problematic. I used to be a vendor (not data store technologies), and I know that vendors have a very hard time saying they CAN’T do something, or that they AREN’T good at solving that problem, or that you SHOULDN’T use their product for your specific use case.
Well, I’m going to try to lay that out in simple terms at the CIO Peer Forum in Vancouver on April 10-11. I promise to write up my observations after the conference, but I’m looking forward to the ability to incorporate the collected thinking of the last decade, and to bringing folks to the somewhat current, production-ready observation of next-generation data storage and retrieval capabilities, with attention to their usage in and around analytics environments. Stay Tuned!
David Lyle, Pacific Data Integrators will be speaking at the CIO Peer Forum in Vancouver.