In this Special Report on the future of the Hadoop platform, the Big Insights team examines the major changes that Hadoop 2 will bring to all the major distributions and the factors that will decide which distributions dominate.
By: David Triggs, CTO, BigInsights
The use of Hadoop to create a next generation architecture
A common theme in discussions on Big Data architectures at Strata + Hadoop World in New York, USA was how to overcome the challenges that plagued many Enterprise Data Warehouse (EDW) projects; especially the time and complexity in defining a single consistent schema for the data warehouse before any data was loaded.
Jim Walker, Hortonworks’ Director of Product Marketing, recently spoke to Raj Dalal, CEO, BigInsights and David Triggs, CTO, BigInsights on how Hadoop was becoming central to next generation architecture, and how it aimed to lead with Hadoop 2. Jim saw overcoming the limits of what he called this “schema on write” as a key contribution of Hadoop because “You don’t have to apply structure on write. We’re talking about structure on read, and that changes the game.”
“With Hadoop”, Jim continued, “you can literally throw a bunch of data into it, and you have this kind of dynamic schema where I’m looking at data one way, and David has a completely different view of it, and Raj has a different view of it.” So while Hadoop running on commodity hardware offers a lower cost per terabyte, and so enables leveraging data that previously wasn’t cost effective to retain, Hadoop is more than just a low cost alternative. The ability to define new schema for new analyses not only overcomes limitations of the single schema, but without the need to define the schema first, these next generation architectures can also reduce the time to value from new data sources. As Jim put it “different structure, different cost, different game.”
Jim, however, also recognised that Hadoop, as it exists today, does not avoid all the challenges faced by data warehouses; and Hadoop will continue to be built out to address them. It would be wrong to view Hadoop as something finished that you just built on. Jim said, “There is still a lot of work to be done in the Hadoop community.” This means it is important for any organisation planning to use Hadoop to understand the changes being made in the Hadoop platform and how they are being developed and brought to the market.
At Strata + Hadoop World, the BigInsights team also caught on with Jack Norris, leader of worldwide marketing efforts at MapR, who argued that the structure of this community is a major strength of Hadoop; including when choosing to deploy a NoSQL database. “There are many commercial distributions, seemingly more entrants every month, but the reality is we all share the same Apache open source code. This is one of the first markets that has actually been created by open source technology,” Jack argued. “This is in quite a contrast to the NoSQL market. In the NoSQL market there is no consensus. There is no common API. There is no ability to seamlessly move workloads across solutions. There is, however, one NoSQL solution that has an inherent advantage and that is HBase. HBase is integrated with Hadoop and included in every commercial distribution.”
Apache Hadoop 2 bringing enhanced interactive SQL performance
Until now, running Hadoop meant running Map Reduce. Even making a query through the Hive data warehouse is executed through Map Reduce. This has not always been responsive enough when there is a person waiting on the results. As Jim explains “Hive has really served as the defacto standard SQL in Hadoop since Facebook created it in 2008. It’s been built out, and it was great for batch processing, however the processing engine underneath it was Map Reduce. Map Reduce isn’t really good at doing complex joins, it wasn’t built for that. It was built for batch processing.”
Hortonworks has taken a lead in improving Hive SQL performance as part of the Stinger project and Hadoop 2, which the Apache Foundation announced as generally availability in mid-October, delivers Phase 2 of that investment. However, while the focus was on performance, Jim felt performance was not the only objective. There was a need to balance performance with other requirements. This is why “Stinger is about speed, scale, and SQL. Human acceptable queries is imperative. You can’t wait 1,400 seconds for something, you need to wait 10; 10 is great. It’s also important to do this at petabyte scale, because we are talking about massive sets of data. Doing joins at petabyte scale is not simple.” There has also been a focus on enhanced SQL processing, such as handling more complex joins, star schema joins, as well as supporting a lot of analytic functions.
Stinger, however, is just one initiative aimed at improving interactive SQL performance on Hadoop. Cloudera built Impala, which is a new high performance, interactive, SQL engine running natively in a distributed way on your big data Hadoop infrastructure. Previously when BigInsights caught up with Mike Olson, Chief Strategy Office and Chairman of Cloudera, he explained “Impala is just an engine that goes to the data. It does not take advantage of any of the MapReduce infrastructure. [It is] an entirely separate scale-out database engine the way you would design it in 2013, a query processing engine, and we know how to build distributed query processors. So that’s what we’ve built.”
While other initiatives to improve SQL on Hadoop are building out new solutions such as Cloudera Impala and the MapR supported Apache Drill project, Hortonworks has focussed on enhancing the current Hadoop components such as Hive. Jim gives the rationale behind it. “I worked with one of the original core developers at Oracle, and he almost laughs when people say they’re going to build these things over the course of a year, because that’s not simple to do.” Nor is Hortonworks trying to do this alone, Jim says. “We’re engaging the broader community. Which is why Stinger is important. Because it’s not just us. It’s Microsoft, SAP, Yahoo, Facebook, we’re talking about 60 to 70 developers, over 20 companies. You have brilliant people like Eric Hanson at Microsoft, who virtually built vectorization in SQL Server, working on Stinger.” “Stinger represents how the community can push forward. Our job at Hortonworks is to rally the community, and to make things more enterprise grade.”
Significance of YARN in the Apache Hadoop 2 platform
While the Stinger enhancements have been eagerly awaited, perhaps the key innovation for the long term that is in Hadoop 2 is YARN, which allows a single Hadoop cluster to run multiple processing models at the same time, with Map Reduce being just one model. As Jim says, “YARN really does change the whole platform. It turns Hadoop into a multi-use platform that you can do all sorts of processing on in a distributed environment – making it an operating system for distributed applications.”
According to Jim, this opens up Hadoop to a world of innovation, allowing other applications to run not just on top of Hadoop, but natively in Hadoop. “YARN allows us to build different processing models, so we have already introduced something called Tez, being developed as part of the Apache Tez project, which is an alternative to Map Reduce.” Tez will allow projects in the Apache Hadoop ecosystem such as Apache Hive and Apache Pig to meet demands for fast response times and extreme throughput at petabyte scale. In addition there is already work going on with Microsoft and SAS exposing current applications to this new distributed framework.
Hortonworks’ Open Source strategy for Hadoop 2 Leadership
Given that there are competing Hadoop distributions, not only from Hortonworks but also from Cloudera, MapR, and others, which distribution should a software developer or enterprise adopt?
Hortonworks believes a pure open source model will be best for their customers over the long term. Jim explains, “The rate of innovation is unrivalled, but more importantly to our customers and prospects is lock-in. They don’t want to be locked into a vendor anymore. They don’t want proprietary solutions. Learning from history very few of them want one single vendor for all things.”
“The pure open source strategy also plays into all the partnerships too, because they don’t want to be tied into one thing. They want to be able to rely on the community. There are two ways to do open source. You can either fork early and patch often; which we like to call franken-patch, because you get so far off the trunk you can’t ever bring back the innovation from the community, or you stay as close to the trunk as you possibly can; and by doing so everybody who works with you enjoys the innovation of everybody in that community.” The value of this approach, Jim argues, was shown in the General Availability (GA) of Hortonworks HDP 2.0 only two weeks after GA of Apache Hadoop 2.2 on which it is based; and this wasn’t a minor change as these are the first versions of Hadoop to include YARN.
The business models for Hadoop distributions are not entirely new, and in many ways are similar to those for Linux distributions, but they are very new in the data warehouse space; and this will require many organizations rethinking their purchasing strategies. Actually most organizations should enjoy having multiple sources of supply of Hadoop solutions and support services, and provide Hadoop suppliers strong motivation to remain competitive. As Jim says “We don’t sell software. If you think what that means to us, we have to provide the single best level of support we can possibly do, because we have nothing else other than our people.”
In Part-2, we will look at how the evolving Hadoop platform is impacting the traditional database and business intelligence companies, and aslo on how the latter are responding, and also look at future areas of Hadoop based innovation.