By: David Triggs, CTO, BigInsights
MapR CTO and co-founder M.C. Srivas has a fairly unique perspective on Big Data and Hadoop. While most people were limited in their understanding of the Google systems that inspired Hadoop to a few Google papers, Srivas worked at Google with those systems. When Srivas met with the BigInsights team he made a convincing case that MapR has built not only a better cloud scale Hadoop platform than Hortonworks or Cloudera, but that in what it has done for Big Data systems of record, it was also ahead of Google.
Hadoop as seen from inside Google
Even a decade ago the Google architectures that inspired Hadoop, running across thousands of servers, would have seemed radically different to the systems that most IT professionals were familiar with; and indeed for many today they still seem radical. But Srivas had previously managed the engineering team for the highly distributed Andrew File System (AFS) at Transarc (now IBM), where he “did a lot of work on how to scale out at very high scale, and how to do security, and how to make it very manageable”. So joining Google in 2007 he decided “In Google you’ve got to do what Google is good at which is search. Either Android or search, and in 2007 Android was still too new. So I worked on that for about three years, and as part of that work I saw how Google had built this infrastructure called GFS, and Map Reduce, and Big Table, and how we used it in search to get these fantastic results. A search engine is literally sifting through garbage, and finding the gems. Think about what’s on the web, literally garbage, and the kind of processing you have to do; and you’re finding the best things out of there. Google was not the first or second search engine, it was about the twentieth search engine, arrived really late to the game, but completely conquered it.”
"The story of Hadoop is usually told from the perspective of its founders like Doug Cutting, then at Yahoo! and now at Cloudera, as they sought to replicate the advanced capabilities they had read about in the few papers published by Google; however Srivas also shed light on the view from within Google. Actually Google collaborated in the creation of Hadoop and helped a lot with the design. Explained Srivas, “The authors of Map Reduce and GFS helped the Yahoo! team build Hadoop, because Google had a shortage of engineers, it would hire those who understood this technology. It was always a nine month to one year learning curve once you joined Google to learn the Google technology before you could start using it. So Google actually initiated Hadoop classes across different universities, especially the University of Washington, the University of Wisconsin, Carnegie Mellon, MIT and so on, and started funding University programs around Hadoop.
The Google people who developed Map Reduce helped Doug Cutting quite a bit on how to build Hadoop, and in fact Google was very generous. Google had patented this technology. They had patents on it, and they could have said 'no this is ours,' but they donated it to the open source and said 'we will make sure nobody else will sue you, because we hold the patents.' So it’s beyond just helping build it. They even gave up the intellectual property to make this successful.”
However, when Hadoop started to be more widely adopted by consumer Internet companies, Srivas and others in Google became concerned. As he describes it, “I was in Google and I was watching this, as I was in search. Then you see companies like Facebook, LinkedIn, Twitter started adopting Hadoop for their own internal processing. This was in 2007-2008, and in Google we were worried, did we let the cat out of the bag. These companies are like Google competitors, using this technology to compete against Google, was there a cause for worry?
“I was in Google on search, and I said let’s take a close look at Hadoop, is it a threat to Google - and it wasn’t, but I saw this tremendous adoption going on, so I thought it was an opportunity to leave Google and found a company to fix some of the issues around it. Hadoop appeared to be this great way of processing data, but it was low quality; not the quality that Google was used to.”
Why just copying Google isn’t the complete answer
With his background Srivas had insights into the issues that could arise when Hadoop was deployed on Linux, as it almost always was, drawing on his experiences before his Google stint. Srivas said, “After leaving Transarc I was part of starting another company called Spinnaker Networks, in 2000. Spinnaker Networks was actually a file system company, but we built Network Attached Storage. It was the first NFS scale out appliance of its type, and we went after the NSF server market. The primary contenders there were NetApp and EMC; and a lot of other startups. When we came out we were three times the performance of a NetApp on the same hardware, and that was a single box then. You could scale out and cluster this – that’s 500 of these at once, and completely blow away anybody else. By the time we came out of stealth it was 2002 or something and Web 2.0 was taking off. A lot of web sites needed this, a lot of rendering farms. We won things like Pixar. When Toy Story 1 first came out, I think at that time Steve Jobs was still at Pixar, he wasn’t at apple yet. So we won a lot of sales and NetApp wound up acquiring the company in 2004.”
With this background Srivas recognised something about Hadoop on Linux that many others hadn’t. “Linux is a good platform for computing, but nobody puts their data in Linux. They put their data always in a NAS or a SAN, like the NetApp, or an EMC Isilon, or a Compellent, or something else. The reason is that when something goes wrong, the bar for enterprise storage is so high now, meaning you cannot lose any bits, no matter what. To provide that kind of guarantee for something like Hadoop you really need to build special purpose hardware, it’s pretty much like rocket science, it’s like putting a force field around it so it doesn’t crash and burn. If something happens the bits are copied somewhere else, everything is battery backed, and special transducers that will detect power failure and write it out to disk immediately, and maybe ‘slup’ it out somewhere else, and so on. That’s what they build. That’s very expensive. You can’t go and buy a standard Dell box or an HP box and have that be as resilient. You can put Linux on it, but nobody puts data on there.
“Hadoop doesn’t solve that, that’s the problem. Hadoop relies on Linux to do its storage, so when you write in Hadoop you’re writing to one of those standard Linux file systems like EXT3 or ZFS. But no enterprise puts their data there, nobody does it today.”
This is a problem that Google could have solved, but think about the Google search application. If you have crawled and indexed the Internet, and then you lose part of that data - perhaps a big part, what do you do? And the answer, beyond trying to keep multiple copied of data for redundancy, is that you aren’t too worried – because you can always go out and crawl and index again, and then you actually have later data. This is a very different situation to an enterprise building a system of record, where Hadoop holds the master copy of valuable data for the enterprise.
Why MapR was founded
With this opportunity in mind, Srivas said, “We started MapR in 2009. That was right at the bottom of the financial and stock market crash. That timing was beautiful, because at that time all the companies were on their heels. Nobody was funding anything, they were cutting down on R&D – the perfect time to start a company. So I started with John Schroeder, who was my co-founder, and we were in stealth for about two years. Stealth is usually when you get to build your technology without anybody finding out what you are doing, so that when you actually announce a product they don’t spoil your announcement by building similar products or diluting your message. We launched in 2011”, exactly three years from the day of our meeting with BigInsights he added.
While most people working with Hadoop are content to assume Google is ahead in their internal technology, Srivas clearly sees MapR as up to the challenge, pointing out, “I think we are ahead of Google in what we have done. If I look at what Google had, and what we’ve done in MapR. So my role, whenever we did a startup, whether with Transarc or Spinnaker, was always to raise the bar; that is take the state of the art to the next level, and I think what we have done at MapR has taken it to the next level in terms of resiliency and scalability in a different dimension. Google has a lot of systems there, they’re a fantastically smart company, but what we picked was a standard problem in computer science of doing completely shared nothing hardware and do completely random updates on it. If I’m trying to write data, try to make data resilient, without any special purpose hardware; no matter how randomly you write it. It’s actually a very hard problem that nobody had actually solved from a technical perspective. Search doesn’t have this problem. Google search is read only. The web page you get isn’t going to change until the next crawl. E-mail is actually write once, you never modify the e-mail, you always write a new e-mail. So a lot of Google’s stuff is actually write once, you don’t really modify it. If you look at how all the database companies run today they run with special purpose hardware. Oracle, EMC, DB2, Teradata all need special purpose hardware. To make the data resilient you have to do things like battery backed memory, special power supplies and so on, and special connectivity which works in a limited space, and doesn’t work on general purpose hardware” of the kind used in cloud scale data center infrastructure.
Srivas argues this has been recognised by the cloud infrastructure providers. “So it is a very hard problem to solve but we solved that, and I think Google saw that, and I think Amazon saw that. So anything can run on Amazon, but if you look at Amazon they sell only software they develop, whether it is Redshift, SimpleDB, DynamoDB. MapR is the only software not developed by Amazon that they resell; because it is a very hard problem that we solved: to make a file-system perform very fast and be very resilient.”
Of the Hadoop distributions, MapR is notable for cloud adoption. Srivas reveals this is by design. “It’s only with MapR that we’ve brought the kind of reliability that you expect from an EMC or NetApp to the barebones Linux platform. It’s the first time, we’ve done that; Google doesn’t do it either. So how does that translate into the cloud? Think about somebody running a VMware shop on premises. They run a cloud, running a bunch of virtual machines. If that virtual machine fails you have vMotion to move it somewhere else, but the data is never on that machine it’s in a SAN or a NAS; so when the VM moves it can find it again. How are you going to do that in the cloud? At Amazon you’re not going to put a SAN in the cloud, you’re not going to put a NetApp in the Google cloud, it’s not going to happen. So you end up compromising today. You run S3 or you run EBS, and S3 is very slow and not really useful to put a database on, its cold storage, you write it and forget about it, you can’t do random updates. So what MapR did was provide this kind of resiliency in the cloud.”
BigInsights CEO Raj Dalal wanted to know if this went beyond cloud providers offering Hadoop as a service, an insight that seems to have been seen by Google. Srivas explained, “For example, now you want to run MySQL in a resilient way, you can run it on MapR. You can run Postgress. You can run SAP in the cloud. How about running Vertica? With Vertica we have a very close relationship. Vertica runs on MapR natively now; you don’t need a SAN. You can run it on MapR. So that’s where the innovation in MapR was and that was what I think everybody saw, for example Google invested like $100 million in MapR, and they provide MapR’s Hadoop as their Hadoop of choice in their cloud." And this is Google -they could build their own Hadoop.”
Building Hadoop based Systems of Record
So when an enterprise is considering an architecture for something like a Data Lake, frequently there is data in there that doesn’t exist elsewhere and is valuable because it could be used to drive value in the future. This makes a Data Lake a system of record to varying degrees. While the degree to which a Data Lake is a system of record needs to be determined by each enterprise, many organisations will face the challenge of needing resilience of large amounts of data without incurring a level of cost that undermines the potential value proposition of that data. That, Srivas believes, is MapR’s opportunity.
“If you want Hadoop to become a Data Lake, or whatever people talk about, which means this has to become the system of record for the enterprise – how are you going to do that on Linux unless you make it resilient? You’ll wind up putting a NetApp behind it. Only MapR can give you that quality of storage which can be a system of record for the enterprise without requiring a NAS or a SAN. Standard Hadoop cannot do that.”
About David Triggs: He is the Chief Technology Officer at BigInsights, and is responsible for advising on Data Science Strategy and Big Data Architecture, and for research in Digital Business Platforms supporting Mobile, Social, Cloud and Big Data based innovation. David has over 20 years of IT industry experience in a variety of technical, consulting, solution marketing, and operational roles. Recently he was a Principal Solution Architect at Hewlett-Packard, working with enterprise customers across the Asia Pacific region. Previous experience includes performance and scalability of database systems, next generation datacentre technology architecture, and development of multilingual software by distributed teams. Most of these roles have involved the deployment of leading edge technologies.