Pentaho has two major parts: The business intelligence side and the data integration side.
Pentaho Business Analytics
The business intelligence side worked pretty well with Pentaho’s sample data making pretty graphs and dashboards. This data appears to be all numerical, count this, sum this, etc. The BI tool is entirely web-based, and looks pretty slick. You can try out an online demo here: http://www.pentaho.com/get-started/. Unfortunately Pentaho Business Analytics only supports JDBC and SQL. Hive does have a JDBC, but I got strange, useless errors (see below). There are two issues for Pentaho-to-Hive support, both of which appear to be closed.
Pentaho Kettle for Big Data
Spoon is the frontend for Kettle, Pentaho’s ETL engine. Kettle is open source. I would describe Spoon as visual SQL, but with more features. Looking briefly at the wiki, I quickly found out how to get data into HBase, and could easily modify that to use IMAP to get the data from my riparian data email into HBase (however it imported a little more than one email per second, in comparison to Timberwolf’s 63 emails/second). They have a ton of other sources and you can also put data straight into HDFS or Mongodb. Spoon also has a bunch of tools to get data out of those data sources into other things (Excel, Salesforce, SQL, XML, S3, and about a dozen others). They have map/reduce support, but this is where it starts to get hairy. Instead of taking one of the clean systems, you have to take a map/reduce input and use a map/reduce output. Unwrapping this information gets cryptic and confusing quickly.
HBase supports map/reduce (part of the reason we picked it), but in order to use it with Spoon, you need the mapper input format to be of the type: org.apache.hbase.mapred.TableInputFormat. Which is hard to figure out, and I may or may not have gotten that part correct, because nothing else ever worked. I eventually gave up after a few days, when I arrived at a NullPointerException in TableInputFormat (line 51), which might be related to this issue. Also, of note, when we originally set up HBase on hdhbase01, we didn't put it in HDFS, so that caused problems, and I needed to move it over to HDFS. One thing to note though, is that there are a lot of transformation operations that can be done with Spoon that don't involve map/reduce--they just probably won't be as fast.
You can check out the open source version of Spoon by downloading the Kettle Client here, which I had trouble finding.
More notes on trying to get HBase to work with Spoon
Hadoop's authorization is based on user name, and since I was running Spoon as a local user on my workstation, not “hduser,” which is the user running Hadoop in our setup. Spoon didn't give a good message to this effect, but I could see it in the hadoop logs. You might find these logs under ~/.hadoop/logs/hadoop-hduser-namenode-hdhbase01.example.com.log, where ~/ is the home directory for the user running hadoop,“hduser” is the name of that user, and “hdhbase01.example.com” is the domain name for your hadoop server.
Both the namenode and the job tracker had to have their location set to hdhbase01:9000 and hdhbase01:9001 instead of localhost:9000 and localhost:9001 respectively. You’ll probably find these configurations in /usr/local/hadoop/conf/hdfs-site.xml and /usr/local/hadoop/conf/mapred-site.xml respectively. I also had to copy the /usr/local/hadoop/hadoop-core-*.jar and /usr/local/hadoop/lib/commons-configuration-*.jar from the hadoop server to lib/pentaho and the hbase lib folder. . I replaced the old version of hadoop-core and added the configuration lib.
Errors encountered while attempting to get HBase to work with Spoon
I frequently got exceptions that were in some way cut off or under-descriptive, but I am unsure how much of this is HBase/Hive’s fault and how much of it is Spoon’s fault.
Error connecting to database [hive on hdhbase01] : org.pentaho.di.core.exception.KettleDatabaseException: Error occured while trying to connect to the database Error connecting to database: (using class org.apache.hadoop.hive.jdbc.HiveDriver) Invalid URL: jdbc:hive//hdhbase01:5678 org.pentaho.di.core.exception.KettleDatabaseException: Error o
Also, when I shortened the url, I got the following:
Error connecting to database [hive on hdhbase01] : org.pentaho.di.core.exception.KettleDatabaseException: Error occured while trying to connect to the database Error connecting to database: (using class org.apache.hadoop.hive.jdbc.HiveDriver) Invalid URL: hdhbase01:5678 org.pentaho.di.core.exception.KettleDatabaseException: Error occured whil
It appears that their messages are limited to a certain number of characters, but I couldn’t find any more useful information in the logs.
If any of you have had (even marginally) better success connecting Pentaho to HBase, let me know in the comments below!