Sunday, April 20, 2014

Hadoop Distributions? Which version? Hype wheel Spins again!

It looks like the major Hadoop vendors are battling it out for mind share in the Big Data community.

The major announcement from Hortonworks for their next round of funding.

Followed by the major announcement that Intel was dropping their distro and backing Cloudera with $740M for an 18% stake.

The only player not jumping into the media blitz is MapR.

This latest round of buffoonery is driving a lot back and forth marketing stuff as the market consolidates and each tries to assert their value and differentiation.  This is driving some misleading press releases from both Cloudera and Hortonworks.  Hortonworks is fighting for its life and honesty I have no idea how they plan to stay relevant.

The tired go-to-market position of, "We've got the most Hadoop committers and ours is 100% Open Source" is wearing thin.

This latest announcement from Arun Murthy's of Hadoop 2.4 is a thinly veiled Hortonworks Product Announcement that in many ways violates the spirit of the Apache Software Foundations recommendations on what constitutes Hadoop and procedures for press announcements. I guess they are getting ready for Hadoop Summit in June and want to steal the spotlight for a moment.

And for the educated reader (someone that takes the time to decode this mess) the last Hadoop Release labeled "GA" is 2.2 and not 2.4

Furthermore - I leave it to the reader to attempt to build and verify Hadoop 2.4 so that it will build and run like the derivative 2.4 from Hortonworks.  I think this will be an enlightening exercise.
To Horton's credit they are still (as of 4/20/2014) listing 2.1  as their current version. Which in someways is confusing even more than just sticking with the 2.2 Apache GA version.  The announcement of Query Grid by Hortonworks+Teradata is nothing more than a continued refinement of Teradata's UDA strategy. Eventually Teradata is going to wake up and realize that Hortonworks is a dead-end and is only riding their coat-tails into major accounts. Bolting SQL-H onto Stinger only makes the stack more prone to failure.

Cloudera is not blameless in this war of announcement-counter-announcement. The latest series of videos and info-mercials about Impala versus Presto | Stinger | Hive is just plain junk.  Especially in the case of Facebook's Presto capabilities at scale.  Which at the moment Impala does have a "snowballs chance in h311" of being able to handle.  And how does the TPC ignore their constant use of "TPC-DS" in a very loose fashion when discussing benchmark results plainly meant as a sales pitch?

So what to do??

Well MapR makes no bones about their value proposition and are quietly building a reputation for quality and reliability. They are embracing emerging technology from Berkley's AMPLab in the form of a partnership with Databricks. Shark/Spark is rapidly becoming the hot tech around real-world analytic projects that deliver business value.

Going with MapR does have some risks since MapR has decided to replace some critical pieces of Apache code with their own.

Then there are the newly minted independents that are building off the Apache main source code trunk.
Bunnyworks.net is a group quietly putting out an Apache "derivative" based solely on the approved 2.2 code line without additions. They are calling it pHd 2.2.  An analogy would be "pHd 2.2 to Apache 2.2" like "CentOS to Redhat".

More than ever, users of Hadoop based technology really need to investigate and understand what they're getting when buying into different Hadoop based product versions.

Caveat emptor!