[Image via Apache Incubator]

Overview

On 02/22/2012 I attended a Hortonworks webinar detailing the overall capabilities of a new Hadoop tool/layer called HCatalog. The basic premise is that HCatalog provides an interface for accessing data stored anywhere in a regular Hadoop tool (Pig, Hive, or MapReduce) format. This makes it much easier to access data, since custom loaders for each data source become unnecessary.

Presentation

The talk itself was given by Alan F. Gates (github | twitter) who is one of the co-founders of HortonWorks. He's a committer on Pig and HCatalog, and wrote the O'Reilly Programming Pig. HCatalog itself was apparently started at Yahoo. One of the biggest strengths I saw in HCatalog came from some early slides where Alan basically said that "sharing data is hard." The specific example given was that a programmer using Pig might load and process some data, and dump it in HDFS somewhere for an analyst to use. That analyst wants to do their own work using Hive, since it has a SQL-like language they understand. The analyst has to figure out where the data is and then use a rather complicated command to load it into Hive. Then they finally can run whatever it is they want on it, but they still have to do it manually. HCatalog attempts to solve these pain points in two ways. The first, as mentioned, is that is provides a layer of abstraction over the logical location of the data. Pig could instead then store the data into an arbitrary "ProcessedData" table in HCatalog, and the analyst could open that same HCatalog "ProcessedData" table with Hive. Additionally, in doing this, they won't have to worry about transforming the data from a form that Pig outputs into a form that Hive understands. Instead, it just works. The second major strength is that the analyst in question doesn't even need to manually start anything. HCatalog currently provides a rudimentary event system over JMS, so upon completion, the Pig job above could notify the Hive job to start. No manual interaction required, instead, again, it just works.

Operations

There was a segue into operations aspects of using HCatalog. It is capable of treating disparate underlying data structures as being a part of the same table. This means that old data can coexist with new data with new columns, all in the same table. Therefore, calling something like alter table doesn't require reformatting any of the pre-existing data in the table, only the new table. Missing columns in the old table simply get nulls. Another operations aspect is due to the hiding of the underlying file locations, tables can be physically moved around without causing issues with user applications.

Future Work

The next release, 0.4 should come out next month. It will include the Hive, Pig, and MapReduce information. It can currently support any specific data that has a Hive SerDe (currently Text, Sequence, RCFile, JSON). This is because HCatalog just uses the Hive formats underneath. The JMS event notification will also be present. They claim to have "basic" HBase integration, but not what that entails. For future versions, they are hoping to improve said integration, particularly for the new security features. Currently HCatalog relies entirely on what is present in HDFS to perform its security model. They are hoping to soon have a complete REST API over JSON.

Future Directions

Basically, they want to be able to store semi and unstructured data. They did not go into details about how. They did go into some detail about the data lifecycle process, and how HCatalog can fit into a few bits of these. One example was archiving, for legal reasons, etc. Most archiving goes to another Hadoop cluster or a data warehouse. Another area is replication, specifically trying to get the same data sets for global companies all over the world. Compaction is generally performed on data more than a few days or a week old, and they currently get stuffed into .har files, a HDFS archiving format. The really old stuff gets deleted in the cleaning phase of the data lifecycle. The way HCatalog fits into the data lifecycle process is by providing basic implementations and interfaces for them. For example, metadata on a HCatalog table could say delete after a month - this would be a basic implementation. The interface could allows more sophisticated plugins to change this behavior.

Another area they want to look in is partitioning data on different storage. It would be awesome if new fresh data could be stored in HBase, to be looked at piece by piece, and then after a few days be shoved into HDFS to be used for batch processing. They would like to expand the capabilities of HCatalog to other massively parallel processing platforms too, like Cassandra and MongoDB. Most companies have a bunch of different storage platforms, so supporting multiple data stores makes things easier for everyone to work with Hadoop. One last piece of future work includes storing HCatalog metadata in HBase instead of an RBDMS, because oftentimes there is simply too much metadata.

Comment