photo by Tsahi Levent-Levi
by Iggy Fernandez
excerpted from The Twelve Days of NoSQL (http://iggyfernandez.wordpress.com/)
NoSQL and Big Data are “disruptive innovations” in the sense used by Harvard professor Clayton M. Christensen. In The Innovator’s Dilemma: When New Technologies Cause Great Firms to Fail, Christensen defines disruptive innovations and explains how they eventually lead to the failure of the established players who ignored them:
“Generally, disruptive innovations were technologically straightforward, consisting of off-the-shelf components put together in a product architecture that was often simpler than prior approaches. They offered less of what customers in established markets wanted and so could rarely be initially employed there. They offered a different package of attributes valued only in emerging markets remote from, and unimportant to, the mainstream.”
Established players usually ignore disruptive innovations because they do not see them as a threat to their bottom lines. In fact, they are more than happy to abandon the low-margin segments of the market, and their profitability actually increases when they do so. The disruptive technologies eventually take over most of the market and the established players fail.
An example of a disruptive innovation is the personal computer. The personal computer was initially targeted only at the home computing segment of the market. Established manufacturers of mainframe computers and minicomputers did not see PC technology as a threat to their bottom lines. Eventually, however, PC technology came to dominate the market, and established computer manufacturers such as Digital Equipment Corporation, Prime, Wang, Nixdorf, Apollo, and Silicon Graphics went out of business.
So where lies the dilemma? Christensen explains:
“In every company, every day, every year, people are going into senior management, knocking on the door saying: ‘I got a new product for us.’ And some of those entail making better products that you can sell for higher prices to your best customers. A disruptive innovation generally has to cause you to go after new markets, people who aren’t your customers. And the product that you want to sell them is something that is just so much more affordable and simple that your current customers can’t buy it. And so the choice that you have to make is: Should we make better products that we can sell for better profits to our best customers. Or maybe we ought to make worse products that none of our customers would buy that would ruin our margins. What should we do? And that really is the dilemma.”
Exactly in the manner that Christensen described, the e-commerce pioneer Amazon.com created an in-house product called Dynamo in 2007 to meet the performance, scalability, and availability needs of its own e-commerce platform after it concluded that mainstream database management systems were not capable of satisfying those needs. The most notable aspect of Dynamo was the apparent break with the relational model; there was no mention of relations, relational algebra, or SQL.
One of the innovations of the NoSQL camp is “schemaless design.” Data is stored in “blobs” and documents and the NoSQL database management system does not police their structure.
Let’s do a thought experiment. Suppose that we don’t have a schema and let’s suppose that the following facts are known.
- Iggy Fernandez is an employee with EMPLOYEE_ID = 1 and SALARY = $1000.
- Mogens Norgaard is an employee with EMPLOYEE_ID = 2, SALARY = €1000, and COMMISSION_PCT = 25.
- Morten Egan is an employee with EMPLOYEE_ID = 3, SALARY = €1000, and unknown COMMISSION_PCT.
Could we ask the following questions and expect to receive correct answers?
- Question: What is the salary of Iggy Fernandez?
- Correct answer: $1000.
- Question: What is the commission percentage of Iggy Fernandez?
- Correct answer: Invalid question.
- Question: What is the commission percentage of Mogens Norgaard?
- Correct answer: 25%
- Question: What is the commission percentage of Morten Egan?
- Correct answer: Unknown.
If we humans can process the above data and correctly answer the above questions, then surely we can program computers to do so.
The above data could be modeled with the following three relations. It is certainly disruptive to suggest that this be done on the fly by the database management system but not outside the realm of possibility.
EMPLOYEE_ID NOT NULL NUMBER(6)
EMPLOYEE_ID NOT NULL NUMBER(6)
EMPLOYEE_ID NOT NULL NUMBER(6)
A NoSQL company called Hadapt has already stepped forward with such a feature:
“While it is true that SQL requires a schema, it is entirely untrue that the user has to define this schema in advance before query processing. There are many data sets out there, including JSON, XML, and generic key-value data sets that are self-describing—each value is associated with some key that describes what entity attribute this value is associated with [emphasis added]. If these data sets are stored in Hadoop, there is no reason why Hadoop cannot automatically generate a virtual schema against which SQL queries can be issued. And if this is true, users should not be forced to define a schema before using a SQL-on-Hadoop solution—they should be able to effortlessly issue SQL against a schema that was automatically generated for them when data was loaded into Hadoop.
Thanks to the hard work of many people at Hadapt from several different groups, including the science team who developed an initial design of the feature, the engineering team who continued to refine the design and integrate it into Hadapt’s SQL-on-Hadoop solution, and the customer solutions team who worked with early customers to test and collect feedback on the functionality of this feature, this feature is now available in Hadapt.” (http://hadapt.com/blog/2013/10/28/all-sql-on-hadoop-solutions-are-missing-the-point-of-hadoop/)
This is not really new ground. Oracle Database provides the ability to convert XML documents into relational tables (http://docs.oracle.com/cd/E11882_01/appdev.112/e23094/xdb01int.htm#ADXDB0120), though it ought to be possible to view XML data as tables while physically storing it in XML format in order to benefit certain use cases. It should also be possible to redundantly store data in both XML and relational formats in order to benefit other use cases.
In Extending the Database Relational Model to Capture More Meaning, Dr. Codd explains how a “formatted database” arises from an unorganized collection of facts:
“Suppose we think of a database initially as a set of formulas in first-order predicate logic. Further, each formula has no free variables and is in as atomic a form as possible (e.g, A & B would be replaced by the component formulas A, B). Now suppose that most of the formulas are simple assertions of the form Pab…z (where P is a predicate and a, b, … , z are constants), and that the number of distinct predicates in the database is few compared with the number of simple assertions. Such a database is usually called formatted, because the major part of it lends itself to rather regular structuring. One obvious way is to factor out the predicate common to a set of simple assertions and then treat the set as an instance of an n-ary relation and the predicate as the name of the relation.”
In other words, a collection of facts can always be organized into relations if necessary.
Big Data in a Nutshell
In 1998, Sergey Brin and Larry Page invented the PageRank algorithm for ranking web pages (The Anatomy of a Large-Scale Hypertextual Web Search Engine by Brin and Page) and founded Google.
The PageRank algorithm required very large matrix-vector multiplications (Mining of Massive Datasets Ch. 5 by Rajaraman and Ullman) so the MapReduce technique was invented to handle such large computations (MapReduce: Simplified Data Processing on Large Clusters).
Smart people then realized that the MapReduce technique could be used for other classes of problems, and an open-source project called Hadoop was created to popularize the MapReduce technique (http://gigaom.com/2013/03/04/the-history-of-hadoop-from-4-nodes-to-the-future-of-data/).
Other smart people realized that MapReduce could handle the operations of relational algebra such as join, anti-join, semi-join, union, difference, and intersection (Mining of Massive Datasets Ch. 2) and began looking at the possibility of processing large volumes of business data (a.k.a. “Big Data”) better and more cheaply than mainstream database management systems.
Initially programmers had to write Java code for the “mappers” and “reducers” used by MapReduce. However, smart people soon realized that SQL queries could be automatically translated into the necessary Java code, and “SQL-on-Hadoop” was born. Big Data thus became about processing large volumes of business data with SQL but better and more cheaply than mainstream database management systems.
However, the smart people have now realized that MapReduce is not the best solution for low-latency queries (http://gigaom.com/2013/11/06/facebook-open-sources-its-sql-on-hadoop-engine-and-the-web-rejoices/). Big Data has finally become about processing large volumes of business data with SQL but better and more cheaply than mainstream database management systems and with or without MapReduce.
That’s the fast-moving story of Big Data in a nutshell. ▲
The statements and opinions expressed here are the author’s and do not necessarily represent those of Oracle Corporation.
“And so the choice that you have to make is: Should we make better products that we can sell for better profits to our best customers. Or maybe we ought to make worse products that none of our customers would buy that would ruin our margins. What should we do? And that really is the dilemma.”