Will the rapid evolution of Hadoop lead to fragmentation or be sufficiently consistent to realize the promise of becoming an Enterprise Data Hub? Hadoop co-founder Doug Cutting shares his predictions for the future with the BigInsights team.
By David Triggs, CTO, BigInsights
If you are planningthe future of your organization’s IT architecture at least one thing has become clear over the last year or two. Hadoop has rapidly emerged as the de-facto standard for storing and processing large amounts of data in its original form – allowing it to be structured to answer new business questions and respond rapidly to new business challenges as they arise. Almost every major database and BI player from Oracle to Microsoft to the new EMC/VMware Pivotal Labs, has Hadoop as this key part of their strategic architecture. But Hadoop is continuing to evolve as fast as the Internet sector that pioneered it, and the question now is will this lead to fragmentation, or to Hadoop emerging with a central role in enterprise IT as what Cloudera calls the Enterprise Data Hub.
The rapid innovation pushing Hadoop toward fragmentation
Once almost synonymous with map reduce, Hadoop is evolving along many dimensions, especially to support interactive use cases. For example Hadoop has been extended with Cloudera Impala and the Hortonworks led Stinger project to allow more interactive queries. Hadoop 2.0 brought YARN and support for map reduce operating alongside other processing frameworks, and more recently one of the hottest Big Data topics has been integration of the real time optimized Apache Spark technology from the Berkley AMP Lab with Hadoop. This diverse range of innovation could easily become a recipe for fragmentation. Trying to guide this evolution, and hopefully avoid this kind of fragmentation, is the Apache Hadoop project co-founded by Doug Cutting.
We could think of no one better to help us understand the likely future direction of Hadoop than Doug Cutting, who co-founded Hadoop while at Yahoo!, naming it after his son’s toy elephant; which he brought with him to Australia recently, resulting in lots of people appearing in photos with Doug and the original Hadoop. (See below our picture with Hadoop!). Doug also went on to co-found the Apache Hadoop project, and in his current role as Chief Architect at Cloudera, continues to be closely involved in guiding the future of Hadoop through the Apache Foundation. While Doug seems unlikely to even have aspirations to become the benevolent dictator of Hadoop, those meeting him are quickly impressed with his extensive experience in open source projects and his keen insight into the future potential of Hadoop. Perhaps he, more than anyone else, can be successful in guiding the Hadoop community in the way that Linus Torvalds is in the Linux community; and encourage Hadoop innovation without it leading to fragmentation.
Why some fragmentation is actually good
We asked Doug how it might be possible to drive innovation through the Apache Foundation processes while avoiding the kind of fragmentation where you can’t have a real data hub because the technologies don’t integrate. "I think it’s going to be a little messy," was Doug’s typically candid answer, "and it is a little messy already. Some of that I think is inherent and good. If you want to evolve systems and it’s not clear which is the best approach then having multiple approaches attempted and having one succeed is a good way. That said, once people adopt technologies, particularly with things like data formats it’s very hard for them to abandon them, and so you end up with this fragmentation problem. It’s a double edged sword there. What’s particularly irksome is when people develop incompatible approaches in order to be incompatible; to have some unique interest. I don’t have a solution to that. Hopefully what we will see is the primary technologies, and we’ve seen this certainly in Hadoop, that the core technologies of the file system and the scheduler, the things that really permit you to share the hardware resources, and permit you to try alternate higher level tools on a common data platform and on common data sets, are standardised and are more or less undisputed. But in some sense they ought to be disputed and there ought to be competitors, and people ought to try other things out to some degree.
As long as you’ve only got a couple the fragmentation is probably tolerable. There is some waste, but you get some benefit from it as well; even if thy never converge they are egging each other on in features. We see that a lot. HBase has recently added cell level ACLs. That was largely triggered by Accumulo getting popular. That kind of thing, I think, is healthy, and we’ll probably have both HBase and Accumulo for a long time because there are institutions that have built things around them. We don’t need ten, but two or three might be a decent cost of doing business. The open source approach of having these different projects is much more an evolutionary survival of the fittest approach. The standards approach has a bit of that, but I think is trying to exert a bit more top down control rather than just letting things happen; which is more the open source way."
Why Apache Hadoop has the approach and enterprise backing to be successful?
When thought of in this way you can see an innovation model with fascinating potential, but are the mechanisms of the Apache Foundation, that defines what Apache Hadoop is, going to be able to keep all these innovations operating on the same platform together?
“Yes,” was Doug’s direct answer, “with Apache there’s a certain amount of tension, it’s not a perfect process, but we also see people who are bitter rivals collaborating on a daily basis and making real progress on their joint goals; and I think that’s a real testament to the success of the approach at Apache. It’s a place where people can come together, and it’s a requirement of an Apache project that it have a diverse set of contributors working together.
In the Hadoop space it’s become a brand that people look for. People have learned that if it’s an Apache project it means something. It’s a project that you can trust, that it’s not going to be controlled by any one vendor. People have learned that there are different kinds of open source, and some open source isn’t as open as it might appear: it can be controlled entirely by one company, the license terms can have some teeth that don’t really allow you to do some things that you want; and Apache has built a brand up where people understand that where projects come from Apache you can trust to last a long time, to be available on reasonable terms, and not to be owned by any one vendor. That’s a neat thing we’ve seen in this space – the whole ecosystem gets that in a way that we haven’t seen in a previous area of IT where there’s been one license, one open source model, where people have said this is the right way to do things and we need to see that before we’ll trust the technology.”
What impact will there be on existing data warehouse and BI suppliers?
However Hadoop is also being deployed in organizations that mostly already have not only relational databases supporting a range of applications, but also data warehouses and other BI systems such as the newer columnar databases. Many of the companies behind these databases, data warehouse systems, and other BI solutions are adopting Hadoop to compliment or enhance their current solutions; and this is also driving the evolution of the larger Hadoop ecosystem.
For example, we recently met with Pivotal Labs president Scott Yara to discuss their strategy. Scott was co-founder of MPP data warehouse Greenplum, which EMC spun out in forming Pivotal. In their evolving strategy Pivotal has recently moved from re-distributing the MapR Hadoop and is now building its own Hadoop distribution: Pivotal HD, that integrates with their HAWQ SQL-on-Hadoop technology, GemFire XD in-memory database technology, and other innovations into a platform with extensive capabilities.
However, in general “the big traditional vendors are in a tough position,” Doug suggests, “They’ve got a cash cow in their existing business and they see this new competitor, and they don’t want to deny its existence, they want to get in on it; but it’s the innovator’s dilemma: how can they embrace it if it cuts into their traditional business.” “So they have to try to position Big Data and to pigeon hole it in a way. They are working hard, some of them, to define it in a way that is complimentary, separate and subsidiary.”
By contrast Cloudera’s vision places Hadoop more centrally in enterprise IT, and not just from a technical standpoint. As Chris Poulos, Cloudera’s Vice President for Asia Pacific/Japan put it “Hadoop was a world that a CEO couldn’t get his or her head around.” Chris explained that not only did Cloudera choose the more enterprise friendly Enterprise Data Hub naming, but critically it was also structuring all the ecosystem into a very logical enterprise class solution. This is a vision that places Hadoop more at the centre of enterprise architecture, and is better suited to driving business value as organizations look to gain insight from new and bigger data sources shared across the organization. “Most institutions today are predominantly using databases for data they collected themselves; and that is their secret sauce that they have and others don’t.” Doug adds, for example “if you look at companies like railway companies – knowledge of where their cars are. That is inherently their information and they need to be able to analyse that, but incorporating other sources is a huge thing,” and it can quickly turn into vastly more useful data once you look at the broader business context that utilizes those rail cars, for example.
Doug argues Hadoop based architectures like Cloudera’s will be the better architecture as the scale of Big Data processing increases. If you look at architectures that simply add Hadoop on, “they are sluping,” a technical term Doug uses,“a lot of data back and ford across those connections, which is a bottleneck – not just for performance of queries, but it’s a bottleneck on processes and what you are willing to consider doing. Are you willing to consider a tight integration between SQL and a streaming process that is based on Spark streaming, where those two are feeding each other in an integrated manner. You can’t do that if you’ve got that separation. People talk about having great big network connections, but still having hardware spindles next to CPU cores, the amount of throughput you can get there is hard to beat.”
If I adopt the Enterprise Data Hub today, is that too going to change rapidly?
BigInsights CEO, Raj Dalal, was also keen to find out from Doug what Cloudera planned to do with their recent massive cash injection from capital raising, that also saw Intel abandon its own Hadoop in favour of working with Cloudera, and especially where Doug saw the technology innovation that wouldbuild out the Cloudera platform.“I don’t think we are going to make any substantial new bets.” Doug said,“We are going to be able to hire more aggressively, we are going to continue to work on improving Impala and related storage engines to try to move closer to being able to permit the sort of operations you could in a traditional database, but on the Hadoop ecosystem.”
Doug added “Security is an ongoing effort. We made tremendous leaps in improving the kinds of security that we support, but we need to keep going. Something we hear again and again from customers is that they want to be able to easily configure things, to permit the access they want and not anything else, and I think its inherently a more difficult security domain than a traditional approach where if you’ve got silos in some ways you’ve already restricted things. Only people who have access to that system could ever touch that data, and its divided around the organisation. When you start pulling things together and you have a wide range of data with a wide range of tools then the opportunity for inappropriate use is higher; so the risk is higher. So you need to develop new and more thorough security mechanisms, which then means you need to think a lot about ease of use and ease of configuration. Just because it’s possible to configure things securely doesn’t mean people will if it’s not simple. So we spend a lot of time not just working on the technology, but then also working on ease of use with Cloudera Manager, Cloudera Navigator – letting people configure their users and groups but then monitor and audit and look at data lineage to see who’s created what from what. Those tools – that’s something we won’t be done with for a long time. As this is more adopted in institutions they need more of these kinds of features.”
This will come as good news to those in enterprise IT who understand the amount of effort and cost involved in operating large scale systems like Hadoop without the right tools. This in turn, and perhaps as much as the innovation we have discussed, may influence businesses to consider placing Hadoop based solutions at the heart of their IT architectures, realizing the vision of the Enterprise Data Hub as a platform around which businesses can create new value from Big Data.
David Triggs is Chief Technology Officer at BigInsights. David has over 20 years of IT industry experience in a variety of technical, consulting, solution marketing, and operational roles. Recently he was a Principal Solution Architect at Hewlett-Packard, working with enterprise customers across the Asia Pacific region. Previous experience includes performance and scalability of database systems, next generation datacentre technology architecture, and development of multilingual software by distributed teams. Most of these roles have involved the deployment of leading edge technologies.
BigInsights (www.BigInsights.co) is an Australia-based research & advisory firm focused on Big Data Analytics. Its express aim is to help companies with best practices and ROI on using Big Data technologies for customer and operational insights, and to help them track emerging trends in new Big Data technologies.