Wednesday, December 5, 2012

Big Data ETL Patterns and Architectures

Introduction:
Traditional approaches to Data Integration (DI) involve processing data via an ETL system that moves data from target to source with data transformation as a middle step. This can be performed in stream, at source, at target (ELT), and combinations in series or parallel. As volume, velocity, and variety of data pushes ETL to extremes, customers are adopting clever architectures to handle Big Data, the location and techniques of traditional ETL are transforming.
Cases listed here are not necessarily distinct and there is overlap between approaches. Combinations and hybrids are definitely probable.

1 - Unknown Source and Target Schema, evolving metadata.

Traditional ETL to/from RDBMS and flat-file sources have assumed that target schema, source data format, and organization of data (usually by row) was fixed and known.  In Hadoop platforms this assumption is not always valid and is in many ways counter to Hadoop’s philosophy of unstructured and flexible storage of massive amounts of data. 

Hadoop and noSQL platforms empower users to store larger data sets with lower “value” density and of indeterminate structure. This philosophy of, “store the data now before you know how and what you’re going to use it for” approach offers great challenges for SQL DBMS processing.

The procedure of extracting useful data out of a data dump requires placement of metadata descriptions on data after loading. This need was motivation behind creation of the Hive metadata project. Hive allows a relational type of access across previous data stored in HDFS table files. Natively this metadata is kept in Apache Derby but most use mySQL.  This was further refined by introduction of HCatalog which rides on top of a Hive metadata store.

This paradigm is “schema-on-read and requires highly skilled staff to analyze, develop, and maintain. Early adopters built complex DI processes and even home brew ETL tools to support a specific environment. Facebook has a tool like this called DataBee that requires highly skilled engineers to do even simple ETL tasks.

An architecture that support this type data environment and requirements was described by Ben Werther and Kevin Beyer of Platfora  as “Agile Iterative ETL.”  This approach assumes that use of data will be refined over time as more information is gained about contents in the data dump and needs in the target system.

As Werther and Beyer describe it:
1.       Land raw data in Hadoop,
2.       Lazily add metadata, Hive or HCatalog
3.       Iteratively construct and refine marts/cubes based on metadata from step 2.

Where does (1) work best:

-          Existing piles of data that have indeterminate and varying structure
-          Evolving requirements in target systems
-          Limited and minimal joins or complex transformations from Hadoop
-          Where target and consumer systems can accept dump-and-pump cycles
-          {More cases?}

2 – Streaming sources, data subset EDW, continuous enrichment

In cases where there is a large volume and velocity of data that is either continuous or has high acceleration profile, a data bus and Hadoop / EDW architecture may be a solution. Typical sources of data streams are Telco services / devices, Web click streams, instrumentation, and sensor devices. What many of this class of solutions drive is a configuration where known stream values are dash boarded for near real time user consumption.  Time limited subsets of the stream are also put into an EDW (Teradata) for short range look back and drill-in analysis.
LinkedIN, Facebook, Ancestry.com, and multiple financial services companies have a message based data bus that simultaneously feed EDW and Hadoop.  These feeds are raw data conduits providing little if any filtering or transformational support to stream subscribers. The EDW has either a filter or lite transformational front-end that lands data in predefined structures.  From this point BI tools and analysis proceed as would on any traditional EDW.
Hadoop or noSQL  feeds are loaded in raw stream format for bulk storage with minimal filtering or transformation. This creates a stream log of data.  The purpose is to satisfy retrospective and “what-if” data from unknown and unanticipated data consumer requirements.  In Hadoop implementations business justification is not always ROI but of easing fear of missing critical data.
A Hadoop data store can be referenced as businesses discover missing data elements or requires access to past data. The ETL process is performed either one-time (catch-up) or intermittently based on data feed frequency.  This is sometimes referred to as: continuous EDW enrichment.  Users with large datasets, > 1Pb, struggle with getting EDW right the first pass and use this approach as a safety net.
Here is a typical flow:
1.       Data sources placed on databus. Tibco, BEA, BizTalk, JBoss ESB, WebSphere … Kafka(LinkedIN)
2.       EDW subscribes to specific stream types and sources, known BI cases
3.       Hadoop is recording most(all) of data stream
4.       BI/Data appliances subscribe to data bus and EDW to provide BI
5.       New requirements drive ETL from Hadoop to enrich EDW

Where does (2) work best:

-          Streaming data sources like click stream, web logs, sensors, SMD (smart mobile devices)
-          BI platform has limited capacity and can only subset data
-          User BI requirements are evolving and look-back is expected
-          EDW and BI platforms require enrichment and analysis discover hidden data of value
-          Data sources are connected and disconnected based on need, dynamic
-          {More cases?}


Oracle’s view of streaming BigData Ecosystem

3 – Massive size, distributed platform, computationally intense or complex

One emerging Big Data scenario presenting a real challenge to ETL and EDW applications is massive, computationally complex, or distributed data sources. There are many applications where augmentation of real-world systems with data processing and data gathering exceed the ability of centralized collection and processing. This does not minimize or negate business requirements for extracting value and intelligence.  To illustrate possible scenarios represented by this case I’ll briefly describe two current applications with these characteristics.  Both cases are currently solved by water-falling data through systems eventually landing in a traditional Teradata like EDW.
First case to consider is cellular service provider and cell tower data.  Each cell tower has an impressive amount of technology used to authorize, monitor, route, and record call and data quality. Requirements for <10ms response times has driven these sites to localize a significant database of information for cellular subscriber services. These databases have technical information related to RF environment and monitoring of phones currently online. Sites must record information about call duration, quality, data connections, and routing for transient handsets. Demand includes responding to incoming command and control or forwarded data/voice traffic.  It is estimated that an average cellular site processes and stores 1Tb/day of new data!  Much of that data must be discarded simply because overhead to backhaul would consume an unacceptable percentage of available bandwidth. Verizon reportedly has almost 40,000 sites in the USA.
The solution currently used is to simply cull out essential bits of data from each site to aggregation sites which then roll up data into an EDW.  Telcos are trying to address this massive problem and have yet to field a reliable and dependable solution. They are motivated by increasing demands to provide cellular user features and improve operational efficiencies at macro and micro levels. In this case ETL is more fixed and closely tied to infrastructure and provides little flexibility to business users if their data requirements change. Each change is a massive roll out with significant risks.
The second case is sensor like data from utility Smart Grid systems.  In the case of PG&E (California) they have approximately 16 million meters that report every hour to every 15 minutes 24x7x365. This stream of data is continuous and very structured in format.  The problem PG&E faces is the need to process data in fixed time frames to satisfy regulatory compliance and customer expectations. A significant amount of processing must also be accomplished in specialized systems that can interpret analog values to represent digestible integer data. While volume is relatively predictable the distributed grid and computational conversion processing for the dataflow is enormous.
Just addressing PG&E’s problem of landing incoming streams is only 30% of the requirement. This data then feeds a wide array of back end systems critical to PG&E operations. Billing, service provisioning, monitoring, safety audits, usage and Green Energy programs, power and gas capacity planning, disaster alerting and service interruption intervention, service personnel dispatching are just a few of their systems.
These two cases look like:
1.       Data site source - initial computations, filter and forward to hub
2.       Hub site - filters, aggregates, stores and forwards to collection center
3.       Collection center - combines data and resolves conflicts, lite transformation, maybe some ETL
4.       Centralized data landing platform – first locale where most of data is combined into an integrated data architecture
5.       Backend Data Systems – platforms that ETL data from (4) as needed to feed line of business platforms.

Where does (3) work best:

-          Systems with distributed and complex network topographies
-          Data has computational complexity based on source localized parameters
-          Size+Number of data sources prevents mass ETL to central site
-          Flow is continuous, dispersed, and unpredictable
-          Central EDW has fixed time SLA for processing across all the source data
-          {more cases?}

Smart Meter Data EcoSystem

Cellular Network

Citations:

Using Hadoop to do Agile Iterative ETL; Ben Werther (Platfora), Kevin Beyer (Platfora); http://strataconf.com/stratany2012/public/schedule/detail/26361