I recently attended Cloudera’s Hadoop Training for Administrators course in Columbia, MD. You can read my recap of the first day here. In the second day, we learned how to deploy a cluster, install and configure Hadoop on multiple machines, and manage and schedule MapReduce jobs.

1. Deploying your cluster:

  • There are 3 different deployment types:
    • Local (dev)
      • No daemons, data is stored on the local disk no HDFS, everything runs in a single JVM
    • Psuedo distributed (dev/debugging b4 production)
      • All daemons run locally individual JVMs
      • Can be thought of as a single machine cluster
    • Cluster (production):
      • Hadoop daemons run a cluster of machines.

2. Installing Hadoop on multiple machines:

  • Don’t manually install linux on every node.
  • Use some kind of automated deployment. (i.e. RedHat Kickstart)
  • Build a standard slave machine image:
    • Reimage a machine rather than troubleshooting software issues
    • CDH is avalaible in multiple formats (packages and tarballs). Packages are recommended because they include some features not in the tarball.
  • Startup scripts can also install the hadoop native libraries. They provide better performance for some Hadoop components.
    • Hadoop includes scripts called start-all.sh and stop-all.sh. These connect to TTs and DNs to start and stop daemons. Don’t use these scripts, as they require passwordless SSH access to slave nodes.
    • Hadoop does not use SSH for any internal communications.
    • You can verify installation with sample jobs shipped with Hadoop
    • Use CD Manager tool for easy deployment and configuration
  • Hadoop Configuration Files:
    • Each machine in the cluster has its own configuration files that reside in /etc/hadoop/conf
    • Primary files are written in XML
    • .20 onwards configurations have been separated out based on functionality
      • Core properties: core-site.xml
      • HDFS properties: hdfs-site.xml
      • MapReduce properties: mapred-site.xml
      • Hadoop-env.sh sets some environment variables.
    • A lot of the default values are performance stale because they are based on a couple year old hardware specs.
    • Best configuration practices are still in the works b/c of the age of Hadoop. Cloudera hopes to have a guide out sometime this year.
  • Rack awareness:
    • Distributes HDFS blocks based on a host’s location
    • Important for cluster reliability, scalability, and performance.
    • The default setting places all nodes in a single rack /default-rack
    • Script can use a flat file, database, etc.
    • A common scenario is the name your hosts back on rack location so a script can simple deconstruct the name to find the host’s location.
    • You can use IP addresses of names to indentify nodes in Hadoop’s configuration file.
    • Most people use names rather than IPs
    • Cluster daemons generally need to be restarted to read in configuration file changes
  • Configuration Management tools:
    • Use configuration management software to manage machine configuration changes at once
    • Start early; retrofitting these changes is always a pain.
    • There are alternatives to using Cloudera Manager like Puppet and Chef

3. Managing and Scheduling jobs

  • You can control MapReduce jobs at the command line or through the web interface. We went over two job schedulers:
    • FIFO is the default scheduler shipped with Hadoop. With FIFO, Jobs run in order submitted, but this is generally a bad idea when multiple people are using the cluster. The only exception to this is when the developer specifies the priority, but if all jobs are given the same priority then you get the idea.
    • Fair Scheduler is shipped with CDH. Fair scheduler is designed for multiple users to share the cluster simultaneously. MapReduce tasks are split up into different pools and are allocated based on pool configuration parameters.