Originally published in DataArt blog.
Information technology has always been full of surprisingly contradicting beliefs and every market, product or community has its own FAQ list or Top 10 Myths whitepaper. This week brought another “myth case” to my desk. Though it has been around for several years already, it is still hot. While my fellow database developers are busy completing another data warehousing project (“traditional” relational solution, by the way) for a travel firm, our marketing department approached me with the discussion of how we can define our new data warehousing offering. The question and concern was: “Hasn’t big data killed data warehousing already?”
The question seems tricky and provokes for diving into architectural details, pros and cons, which solution better supports data intake or business analytics or interactive visualisation. I have to confess, I’m not the saint, so I started with categories which the mind of database professional dictates – reads and writes efficiency, scalability, data consistency, data query technologies. The list kept growing, but was not taking me any closer to the answer. I spent some time trying to sort out differentiators for each technology, but with no success (Technology is the key word here, so remember it and continue reading.)
The reason why I failed to produce a good comparison is quite simple – my database pro’s brain assumed that the term “data warehouse” is equal to “relational data warehouse”. We know that relational data warehouse (or “traditional” data warehouse, as some marketing whitepapers say) are in fact relational databases, which host structured data. But what if we remove “relational” from the equation? What does “data warehouse” mean then? Can we have non-relational DW?
To answer these rhetorical questions, let us take a quick look at the definitions. There are several authoritative sources in this area, I will provide two. Here is how Ralph Kimball describes the enterprise data warehouse:
“The complete end-to-end data warehouse and business intelligence system (DW/BI System). Although some would argue that you can theoretically deliver business intelligence without a data warehouse and vice versa, that is ill-advised from our perspective. Linking the two together in the DW/BI acronym reinforces their dependency. Independently, we refer to the queryable data in your DW/BI system as the enterprise data warehouse, and value-add analytics as BI (business intelligence) applications.”
(Source: The Data Warehouse Lifecycle Toolkit, Second Edition (Wiley, 2008). Ralph Kimball and colleagues)
Pay attention, there is no word “relational” here. Not even database. Another widely used definition is given by Bill Inmon, who is believed to be the father of data warehousing:
“The data warehouse is a collection of integrated subject-oriented databases designed to support the DSS (decision support system) function, where each unit of data is relevant to some moment in time. The data warehouse contains atomic data and lightly summarized data.”
(Source: Building the Data Warehouse, Fourth Edition (Wiley, 2005), W.H. Inmon)
Though we can see the term “database” here, Mr. Inmon says nothing about which particular technology should be used. In my personal opinion, database in the later definition is pretty much equivalent to “queryable data” in the first one or to “an array of queryable data sets” in my own words.
Thus, theoretically speaking, data warehouse implementations are not bound to particular database technologies. Today we can see a very diverse landscape of products and solutions, which allow building coherent and mature “single point of truth”. This could be “classic” relational data warehouses, NoSQL data warehouses, cloud data warehouses, hybrid data warehouses, and … surprise! … data warehouses, based on big data technologies.
Let’s now get a good definition of big data and take a look at how it compares to definitions provided for the term “data warehouse”. Unlike data warehousing, big data is quite young, so definition has been long debated at conferences and on Wikipedia. Here is what we have there at the moment of writing:
“Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time. Big data “size” is a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data. Big data requires a set of techniques and technologies with new forms of integration to reveal insights from datasets that are diverse, complex, and of a massive scale.”
By reviewing definitions, you can see how big data is different from data warehouse – big data is a set of technologies to provide the way to acquire, store and process large amount of data, whereas DW is collection of data sets built for a purpose – of enabling business to discover information from the data and unlock insights from that information. Can we compare technology with purposely built data sets? No. It is like comparing apples to oranges or nails to trees – these are non-comparable things.
As we speak about data warehouses, we should keep in mind that they cannot be implemented without data warehousing. Data warehousing is the colloquial term for the process of moving through DW lifecycle – planning, designing, implementing and so on. There are several versions of how the process should look like, but the key piece is the same – the process should be business oriented and business stakeholders should be highly involved into defining requirements, evaluating impact and value, modelling the enterprise data glossary. All these activities do influence the final data warehouse architecture and, of course, the technology stack. As shown above, big data technologies may be involved. In fact, today they are highly involved.
Now we can answer the headline question – hasn’t big data killed data warehousing already? – of course not!Technology cannot kill the process and cannot kill the mature architecture and related design patterns and best practices. Personally I prefer to think that big data removes limitations, which were holding back “traditional” data warehouse implementations. And I’m not alone in my thoughts. Should you still doubt – take a look at what recognised thought leaders say.
Further reading on big data vs. data warehousing
- Big Data Implementation vs. Data Warehousing (Bill Inmon)
- Newly Emerging Best Practices for Big Data (A Kimball Group White Paper by Ralph Kimball)
- Hadoop and the Data Warehouse: When to Use Which (Dr. Amr Awadallah, Founder and CTO, Cloudera, Inc; Dan Graham, General Manager, Enterprise Systems, Teradata Corporation)