Sunday, April 20, 2014

Hadoop Distributions? Which version? Hype wheel Spins again!

It looks like the major Hadoop vendors are battling it out for mind share in the Big Data community.

The major announcement from Hortonworks for their next round of funding.

Followed by the major announcement that Intel was dropping their distro and backing Cloudera with $740M for an 18% stake.

The only player not jumping into the media blitz is MapR.

This latest round of buffoonery is driving a lot back and forth marketing stuff as the market consolidates and each tries to assert their value and differentiation.  This is driving some misleading press releases from both Cloudera and Hortonworks.  Hortonworks is fighting for its life and honesty I have no idea how they plan to stay relevant.

The tired go-to-market position of, "We've got the most Hadoop committers and ours is 100% Open Source" is wearing thin.

This latest announcement from Arun Murthy's of Hadoop 2.4 is a thinly veiled Hortonworks Product Announcement that in many ways violates the spirit of the Apache Software Foundations recommendations on what constitutes Hadoop and procedures for press announcements. I guess they are getting ready for Hadoop Summit in June and want to steal the spotlight for a moment.

And for the educated reader (someone that takes the time to decode this mess) the last Hadoop Release labeled "GA" is 2.2 and not 2.4

Furthermore - I leave it to the reader to attempt to build and verify Hadoop 2.4 so that it will build and run like the derivative 2.4 from Hortonworks.  I think this will be an enlightening exercise.
To Horton's credit they are still (as of 4/20/2014) listing 2.1  as their current version. Which in someways is confusing even more than just sticking with the 2.2 Apache GA version.  The announcement of Query Grid by Hortonworks+Teradata is nothing more than a continued refinement of Teradata's UDA strategy. Eventually Teradata is going to wake up and realize that Hortonworks is a dead-end and is only riding their coat-tails into major accounts. Bolting SQL-H onto Stinger only makes the stack more prone to failure.

Cloudera is not blameless in this war of announcement-counter-announcement. The latest series of videos and info-mercials about Impala versus Presto | Stinger | Hive is just plain junk.  Especially in the case of Facebook's Presto capabilities at scale.  Which at the moment Impala does have a "snowballs chance in h311" of being able to handle.  And how does the TPC ignore their constant use of "TPC-DS" in a very loose fashion when discussing benchmark results plainly meant as a sales pitch?

So what to do??

Well MapR makes no bones about their value proposition and are quietly building a reputation for quality and reliability. They are embracing emerging technology from Berkley's AMPLab in the form of a partnership with Databricks. Shark/Spark is rapidly becoming the hot tech around real-world analytic projects that deliver business value.

Going with MapR does have some risks since MapR has decided to replace some critical pieces of Apache code with their own.

Then there are the newly minted independents that are building off the Apache main source code trunk. is a group quietly putting out an Apache "derivative" based solely on the approved 2.2 code line without additions. They are calling it pHd 2.2.  An analogy would be "pHd 2.2 to Apache 2.2" like "CentOS to Redhat".

More than ever, users of Hadoop based technology really need to investigate and understand what they're getting when buying into different Hadoop based product versions.

Caveat emptor!

Tuesday, August 27, 2013

Simple way to get 30% better HDFS read perfromance

Working with a high speed cluster this week doing some benchmarks.

I had assumed that the vendor engineering team was sharp enough to have set up the SLES 11 system with best practices.

Ran my first tests and noticed that the individual datanodes were performing some write activity on a purely read only  operation.  This was puzzling since I was trying to get maximum read numbers.  I double checked my tests to verify I wasn't doing some unintentional  write op.

After spending about 30 minutes looking it over it dawned on me to check the file system.  Issued a 'mount' command and didn't see the 'noatime' flag set.

Looked at the /etc/fstab and sure enough it was missing.
So I added the noatime flag on each mount and issued a 'mount -o remount'

Reran the tests and pow! About 30% better performance.

What is noatime?

This flag tells Linux to not update the accessed time on the files that make up the HDFS blocks.  There is no point in maintaining this information underneath Hadoop.

Dave W

Wednesday, December 5, 2012

Big Data ETL Patterns and Architectures

Traditional approaches to Data Integration (DI) involve processing data via an ETL system that moves data from target to source with data transformation as a middle step. This can be performed in stream, at source, at target (ELT), and combinations in series or parallel. As volume, velocity, and variety of data pushes ETL to extremes, customers are adopting clever architectures to handle Big Data, the location and techniques of traditional ETL are transforming.
Cases listed here are not necessarily distinct and there is overlap between approaches. Combinations and hybrids are definitely probable.

1 - Unknown Source and Target Schema, evolving metadata.

Traditional ETL to/from RDBMS and flat-file sources have assumed that target schema, source data format, and organization of data (usually by row) was fixed and known.  In Hadoop platforms this assumption is not always valid and is in many ways counter to Hadoop’s philosophy of unstructured and flexible storage of massive amounts of data. 

Hadoop and noSQL platforms empower users to store larger data sets with lower “value” density and of indeterminate structure. This philosophy of, “store the data now before you know how and what you’re going to use it for” approach offers great challenges for SQL DBMS processing.

The procedure of extracting useful data out of a data dump requires placement of metadata descriptions on data after loading. This need was motivation behind creation of the Hive metadata project. Hive allows a relational type of access across previous data stored in HDFS table files. Natively this metadata is kept in Apache Derby but most use mySQL.  This was further refined by introduction of HCatalog which rides on top of a Hive metadata store.

This paradigm is “schema-on-read and requires highly skilled staff to analyze, develop, and maintain. Early adopters built complex DI processes and even home brew ETL tools to support a specific environment. Facebook has a tool like this called DataBee that requires highly skilled engineers to do even simple ETL tasks.

An architecture that support this type data environment and requirements was described by Ben Werther and Kevin Beyer of Platfora  as “Agile Iterative ETL.”  This approach assumes that use of data will be refined over time as more information is gained about contents in the data dump and needs in the target system.

As Werther and Beyer describe it:
1.       Land raw data in Hadoop,
2.       Lazily add metadata, Hive or HCatalog
3.       Iteratively construct and refine marts/cubes based on metadata from step 2.

Where does (1) work best:

-          Existing piles of data that have indeterminate and varying structure
-          Evolving requirements in target systems
-          Limited and minimal joins or complex transformations from Hadoop
-          Where target and consumer systems can accept dump-and-pump cycles
-          {More cases?}

2 – Streaming sources, data subset EDW, continuous enrichment

In cases where there is a large volume and velocity of data that is either continuous or has high acceleration profile, a data bus and Hadoop / EDW architecture may be a solution. Typical sources of data streams are Telco services / devices, Web click streams, instrumentation, and sensor devices. What many of this class of solutions drive is a configuration where known stream values are dash boarded for near real time user consumption.  Time limited subsets of the stream are also put into an EDW (Teradata) for short range look back and drill-in analysis.
LinkedIN, Facebook,, and multiple financial services companies have a message based data bus that simultaneously feed EDW and Hadoop.  These feeds are raw data conduits providing little if any filtering or transformational support to stream subscribers. The EDW has either a filter or lite transformational front-end that lands data in predefined structures.  From this point BI tools and analysis proceed as would on any traditional EDW.
Hadoop or noSQL  feeds are loaded in raw stream format for bulk storage with minimal filtering or transformation. This creates a stream log of data.  The purpose is to satisfy retrospective and “what-if” data from unknown and unanticipated data consumer requirements.  In Hadoop implementations business justification is not always ROI but of easing fear of missing critical data.
A Hadoop data store can be referenced as businesses discover missing data elements or requires access to past data. The ETL process is performed either one-time (catch-up) or intermittently based on data feed frequency.  This is sometimes referred to as: continuous EDW enrichment.  Users with large datasets, > 1Pb, struggle with getting EDW right the first pass and use this approach as a safety net.
Here is a typical flow:
1.       Data sources placed on databus. Tibco, BEA, BizTalk, JBoss ESB, WebSphere … Kafka(LinkedIN)
2.       EDW subscribes to specific stream types and sources, known BI cases
3.       Hadoop is recording most(all) of data stream
4.       BI/Data appliances subscribe to data bus and EDW to provide BI
5.       New requirements drive ETL from Hadoop to enrich EDW

Where does (2) work best:

-          Streaming data sources like click stream, web logs, sensors, SMD (smart mobile devices)
-          BI platform has limited capacity and can only subset data
-          User BI requirements are evolving and look-back is expected
-          EDW and BI platforms require enrichment and analysis discover hidden data of value
-          Data sources are connected and disconnected based on need, dynamic
-          {More cases?}

Oracle’s view of streaming BigData Ecosystem

3 – Massive size, distributed platform, computationally intense or complex

One emerging Big Data scenario presenting a real challenge to ETL and EDW applications is massive, computationally complex, or distributed data sources. There are many applications where augmentation of real-world systems with data processing and data gathering exceed the ability of centralized collection and processing. This does not minimize or negate business requirements for extracting value and intelligence.  To illustrate possible scenarios represented by this case I’ll briefly describe two current applications with these characteristics.  Both cases are currently solved by water-falling data through systems eventually landing in a traditional Teradata like EDW.
First case to consider is cellular service provider and cell tower data.  Each cell tower has an impressive amount of technology used to authorize, monitor, route, and record call and data quality. Requirements for <10ms response times has driven these sites to localize a significant database of information for cellular subscriber services. These databases have technical information related to RF environment and monitoring of phones currently online. Sites must record information about call duration, quality, data connections, and routing for transient handsets. Demand includes responding to incoming command and control or forwarded data/voice traffic.  It is estimated that an average cellular site processes and stores 1Tb/day of new data!  Much of that data must be discarded simply because overhead to backhaul would consume an unacceptable percentage of available bandwidth. Verizon reportedly has almost 40,000 sites in the USA.
The solution currently used is to simply cull out essential bits of data from each site to aggregation sites which then roll up data into an EDW.  Telcos are trying to address this massive problem and have yet to field a reliable and dependable solution. They are motivated by increasing demands to provide cellular user features and improve operational efficiencies at macro and micro levels. In this case ETL is more fixed and closely tied to infrastructure and provides little flexibility to business users if their data requirements change. Each change is a massive roll out with significant risks.
The second case is sensor like data from utility Smart Grid systems.  In the case of PG&E (California) they have approximately 16 million meters that report every hour to every 15 minutes 24x7x365. This stream of data is continuous and very structured in format.  The problem PG&E faces is the need to process data in fixed time frames to satisfy regulatory compliance and customer expectations. A significant amount of processing must also be accomplished in specialized systems that can interpret analog values to represent digestible integer data. While volume is relatively predictable the distributed grid and computational conversion processing for the dataflow is enormous.
Just addressing PG&E’s problem of landing incoming streams is only 30% of the requirement. This data then feeds a wide array of back end systems critical to PG&E operations. Billing, service provisioning, monitoring, safety audits, usage and Green Energy programs, power and gas capacity planning, disaster alerting and service interruption intervention, service personnel dispatching are just a few of their systems.
These two cases look like:
1.       Data site source - initial computations, filter and forward to hub
2.       Hub site - filters, aggregates, stores and forwards to collection center
3.       Collection center - combines data and resolves conflicts, lite transformation, maybe some ETL
4.       Centralized data landing platform – first locale where most of data is combined into an integrated data architecture
5.       Backend Data Systems – platforms that ETL data from (4) as needed to feed line of business platforms.

Where does (3) work best:

-          Systems with distributed and complex network topographies
-          Data has computational complexity based on source localized parameters
-          Size+Number of data sources prevents mass ETL to central site
-          Flow is continuous, dispersed, and unpredictable
-          Central EDW has fixed time SLA for processing across all the source data
-          {more cases?}

Smart Meter Data EcoSystem

Cellular Network


Using Hadoop to do Agile Iterative ETL; Ben Werther (Platfora), Kevin Beyer (Platfora);

Sunday, October 14, 2012

Big Piles Of Data - Facebook, LinkedIN etc

I've had the chance to work onsite at Facebook for a few weeks with their BI group doing a POC for Informatica's ETL tools against their platforms.

It was an interesting experience and I enjoyed the great food and hospitality of my Facebook colleagues. While I'm still not a fan of Facebook as a user I do admire the energy and fast paced environment. It is the type of place that breeds creativity and out of the box thinking. The Meno Park campus vibrates with excitement.

Some of the mottos stenciled everywhere are: "Move fast and break things!"    "HACK" is plastered everywhere and is the paradigm for their development mentality. "Done is better than perfect." Think of it like Agile after downing a couple of Red Bulls!

They typically show up at 10am and work until ..... its gets done!

With all the innovation and creativity Facebook still has to wrestle with down to earth problems.  How do I get data from point A to B?  What does the data mean?  Is the quality of the data good enough to make decisions and spend resources on?

The scale that they work on is LARGE.   How large?  The main HDFS is pushing 110PB and growing fast. They just passed 1 Billion active users and have plans to expand the usage as much as possible across the globe.  Zuckerberger gave every employee a little Red Book with some thoughts on this milestone. The basic theme is "1% is not done" and we've only just started with 1% of the population.

Back to reality; the fact is that they have discovered that Big Data is really just "Big Piles Of Data."  Which is totally useless until you extract value from it. Relationships, likes, dislikes, needs, desires, and dreams.

And the fact that everyone is distracted by Hype around Big Data is not lost on some of them.  I think that companies like Cloudera, Hortonworks, MapR, and other wannabes for "The Hadoop Standard" are the future Sybase, Informix, and Borlands of this wave of technology.

Yes there is room for these folks to make money and some will.  They will leverage their VC money and gain market share then cash out and move to the Next Gig.  But I doubt their products will have a lasting legacy. Market consolidation is not only a certainty it has already begun.

Ultimately, as the Facebook folks realize,  Hadoop is nothing but a cheap and commodity  technology to store Big Piles of data.  As more businesses come off the euphoria of Big Data Hype and realize it is NOT the silver bullet to solve their problems, more traditional software companies like Teradata / Aster, Oracle, IBM, Teracotta/Software AG and other data analytics companies will make large inroads by supplying software and systems to perform useful business analytics.

Watch this trend in the next 12 to 18 months.  I predict it will be an exciting time on the other side of the Big Data Wave.