A blog about security, privacy, algorithms, and email in the enterprise. 

Viewing entries tagged


FSDataInputStream from a byte array

We're in the process of adding Hive support to Timberwolf, which involves writing files into HDFS so that they can get loaded into Hive tables. Writing to HDFS involves FSDataOutputStreams and FSDataInputStreams, which are all fine and good until you want to start writing tests. My normal approach when testing something that writes to a stream is to create it with a stream that's ultimately backed by a byte array (generally through ByteArrayOutputStream), then pull those bytes out and verify that they're all what I expect them to be. In this case, I was writing a sequence file, so I figured I could use SequenceFile.Reader to pull out my key/value pairs and check that they're correct. That is, until I tried constructing an FSDataInputStream with a ByteArrayInputStream.

Turns out, FSDataInputStream imposes requirements on its backing streams that aren't reflected in the constructor's type signature: FSDataInputStream#FSDataInputStream. So I needed to get a stream that I could construct from a byte array that also implemented PositionedReadable and Seekable. As it turns out, there isn't one of those in the org.apache.hadoop.fs namespace, so I went ahead and rolled my own: SeekablePositionedReadableByteArrayInputStream. It's not complete, since I wasn't sure what exactly seekToNewSource should do and I didn't need it for my tests, but it gets enough of the job done. Maybe it'll help you, too?




[Image via Apache Incubator]


On 02/22/2012 I attended a Hortonworks webinar detailing the overall capabilities of a new Hadoop tool/layer called HCatalog. The basic premise is that HCatalog provides an interface for accessing data stored anywhere in a regular Hadoop tool (Pig, Hive, or MapReduce) format. This makes it much easier to access data, since custom loaders for each data source become unnecessary.


The talk itself was given by Alan F. Gates (github | twitter) who is one of the co-founders of HortonWorks. He's a committer on Pig and HCatalog, and wrote the O'Reilly Programming Pig. HCatalog itself was apparently started at Yahoo. One of the biggest strengths I saw in HCatalog came from some early slides where Alan basically said that "sharing data is hard." The specific example given was that a programmer using Pig might load and process some data, and dump it in HDFS somewhere for an analyst to use. That analyst wants to do their own work using Hive, since it has a SQL-like language they understand. The analyst has to figure out where the data is and then use a rather complicated command to load it into Hive. Then they finally can run whatever it is they want on it, but they still have to do it manually. HCatalog attempts to solve these pain points in two ways. The first, as mentioned, is that is provides a layer of abstraction over the logical location of the data. Pig could instead then store the data into an arbitrary "ProcessedData" table in HCatalog, and the analyst could open that same HCatalog "ProcessedData" table with Hive. Additionally, in doing this, they won't have to worry about transforming the data from a form that Pig outputs into a form that Hive understands. Instead, it just works. The second major strength is that the analyst in question doesn't even need to manually start anything. HCatalog currently provides a rudimentary event system over JMS, so upon completion, the Pig job above could notify the Hive job to start. No manual interaction required, instead, again, it just works.


There was a segue into operations aspects of using HCatalog. It is capable of treating disparate underlying data structures as being a part of the same table. This means that old data can coexist with new data with new columns, all in the same table. Therefore, calling something like alter table doesn't require reformatting any of the pre-existing data in the table, only the new table. Missing columns in the old table simply get nulls. Another operations aspect is due to the hiding of the underlying file locations, tables can be physically moved around without causing issues with user applications.

Future Work

The next release, 0.4 should come out next month. It will include the Hive, Pig, and MapReduce information. It can currently support any specific data that has a Hive SerDe (currently Text, Sequence, RCFile, JSON). This is because HCatalog just uses the Hive formats underneath. The JMS event notification will also be present. They claim to have "basic" HBase integration, but not what that entails. For future versions, they are hoping to improve said integration, particularly for the new security features. Currently HCatalog relies entirely on what is present in HDFS to perform its security model. They are hoping to soon have a complete REST API over JSON.

Future Directions

Basically, they want to be able to store semi and unstructured data. They did not go into details about how. They did go into some detail about the data lifecycle process, and how HCatalog can fit into a few bits of these. One example was archiving, for legal reasons, etc. Most archiving goes to another Hadoop cluster or a data warehouse. Another area is replication, specifically trying to get the same data sets for global companies all over the world. Compaction is generally performed on data more than a few days or a week old, and they currently get stuffed into .har files, a HDFS archiving format. The really old stuff gets deleted in the cleaning phase of the data lifecycle. The way HCatalog fits into the data lifecycle process is by providing basic implementations and interfaces for them. For example, metadata on a HCatalog table could say delete after a month - this would be a basic implementation. The interface could allows more sophisticated plugins to change this behavior.

Another area they want to look in is partitioning data on different storage. It would be awesome if new fresh data could be stored in HBase, to be looked at piece by piece, and then after a few days be shoved into HDFS to be used for batch processing. They would like to expand the capabilities of HCatalog to other massively parallel processing platforms too, like Cassandra and MongoDB. Most companies have a bunch of different storage platforms, so supporting multiple data stores makes things easier for everyone to work with Hadoop. One last piece of future work includes storing HCatalog metadata in HBase instead of an RBDMS, because oftentimes there is simply too much metadata.



Review: Karmasphere Analyst

This week, I took some time to evaluate Karmasphere  Analyst. Particularly, I was interested in how it worked with Hadoop (as opposed to MapR, which it also supports).

Setting up

The setup for Karmasphere is rather painless: a simple installer on windows and a shell script on Linux. However, the windows version does require cygwin. Once open, Karmasphere divides itself into three major steps.


This is where you set up connections to existing HDFS databases. Karmasphere only supports Hive, but it's pretty nice about it... kind of. It will go through the process of installing Hive for you through a rather nice GUI, which allows you to easily specify a Derby database, MySQL database, or whatever other database you have a Java connector for. The downside to this is you can't easily use an already-existing Hive installation. This was a major shortcoming for me, but I get the impression that it should be possible to import an existing Hive database. I'll let you know as soon as the Karmasphere rep gets back to me.


Once I decided to install a new Hive metastore (which was rather painless), importing new tables from sequence files was simple for all the steps that involved Karmasphere (making the sequence file was annoying though). I don't have a problem with how Karmasphere does this. My only real problem is that it seems to hide away the shell that interacts with the Hive cluster Karmasphere uses, which seems like it might be limiting. I could be wrong, but I don't see how you could ever import anything without working through Karmasphere.


Supposedly, this is where the magic happens. The interface here was much simpler compared to other analytic tools. But that may be because there is not fancy drag-and-drop interface, or amazing visual features. It turns out Karmasphere is a glorified query writer. But in its defense, it's very glorified. I've written queries against Hive before, but I've never managed to write them as quickly or as painlessly as Karmasphere allows me to. The bells and whistles it brings to the table include:

  • immediate and clear feedback regarding any errors or warnings in your queries
  • one-click execution of any written queries
  • caching of past queries and results
  • effective sampling of data to test queries on smaller subsets
  • Table, column, and function library indexes
  • A "Query Plan" which shows you just how exactly your query will translate into Hadoop map-reduces

Once you have your data, it's pretty simple to export that data into various useful mediums such as Excel files, SQL tables, or perhaps back into Hive. Also, there is some charting functionality that was relatively simple to use, although I didn't look too much into it since it wasn't of interest to me.


All this makes the tool worthwhile, but I'm not sure it's worth the price (we were unable to obtain pricing information at time of publication, but will update if they get back to us). Since ultimately, you are just making queries, it doesn't add any additional analytic functionality that we couldn't do before. Technically, once you make your query, you don't even need Karmasphere anymore. Although once you have your data, it does let you do several things with that data that would otherwise be difficult to do (export, graphing, etc...).

If you're looking to analyze your unstructured data, I would say Karmasphere is ill-suited for the task, as unstructured data tends to take more than just the SQL-like queries Hive offers. All in all, this product is useful. But once my trial runs up, I will discontinue use.


Mo' Data, Mo' Problems, E04: Pig vs. Hive

Well, really it should be "Pig or Hive" or "Pig and Hive," because these two methods of querying Hadoop tend to serve different functions. We discuss some of them, along with bandying about a little ig-pay atin-lay and an announcement of our new Exchange/HBase importer, Project Timberwolf, in this episode of our big data series.