I recently attended Cloudera’s Hadoop Training for Administrators course in Columbia, MD. Over the next few posts, I’ll discuss how it went and what I learned. All 28 chairs in the computer lab were full. There was another Cloudera employee sitting in to review and update the certification exam.
Our trainer was Ian Wrigley (@iwrigley). He has been with Cloudera for about 2 years. He is the current manager for class curriculum and certification materials. He is originally from the UK (still has the accent) and currently lives in LA. Previously, he's taught for about 20 years on several different technology topics:Oracle, Sun, mysql, systems administration, java, c, php. He is the contributing editor for UNIX and open source for PC Pro in the UK.
The class was primarily made up of Linux systems administrators with some systems architects, dbas, and developers sprinkled in. It seemed like a large number of the class worked for or contracted with the government (expected for the location). A couple of people mentioned using Hadoop in the cloud. Several mentioned the need to migrate data from RDBMS into Hadoop. Everyone expressed the need to adequately manage their clusters. Cluster sizes seemed to range widely. I heard sizes of 13, 128, 270, 2000. Ian introduced us what Cloudera offers: CDH distributions, training, consulting, support, enterprise management software.
I asked if they support the entire stack that is shipped with a CDH distribution. Ian told me that if a paying customer finds a bug, Cloudera’s engineers will fix the bug and submit it back to the community.
Why use Hadoop? It solves the fundamental problem of resource bottlenecks on a single machine by distributing workload across a cluster of commodity machines. Hadoop provides the framework, HDFS for distributed file storage and MapR for distributing tasks across a cluster. There was a brief review of the Hadoop history, Google, Doug Cutting, and Nutch.
Here are a few notes on Hadoop architecture and hardware I scribbled down during class.
- All metadata is held in RAM for fast response:
- Each items consumes 150-200 bytes of RAM
- 1 item is the file name metadata (name, permissions, etc.)
- 1 item for each file block.
- Larger file size = less RAM
- HDFS blocks are named blk_xxxxx:
- A checksum file is also created when a file block is written to disk
- Dealing with data corruption:
- As a Datanode reads a block is also calculates checksum.
- Live checksum is compared to old checksum
- If different, the client reads from the next DataNode in the list
- The NameNode is informed of the corrupt block and re-replicates it to another DataNode
- The DataNode also verifies the checksum for blocks every three weeks after the block was created.
- File permissions are very similar to Unix:
- X (execute) is required for directories or you can access files within
- Few people actually implement Keberos, they use firewalls or gateways
- Secondary NameNode isn’t a not a HA or failover NameNode:
- It’s only used to periodically combine the fsimage and edits log.
There are a couple of Hadoop Namenode enhancements in the upcoming CDH4 distribution. CDH4, once released, will add a federated Namenode, which allows two name nodes to answer requests, and HA Secondary Namenode to alleviate the current architecture where the Namenode is a single point of failure.
- In between the Map and Reduce is the shuffle and sort
- This the magic sauce; itsends data between Mappers and Reducers
- Each Mapper processes a single input split from HDFS.
- Often a HDFS block.
- Intermediate data is written to local file system disk.
- Output from Reducers is written back to HDFS.
- If any task fails to report in 10 minutes is assumed to have failed.
Planning your Hadoop cluster.
Ian is in the“Save the money, buy more nodes” camp, stressing quantity over quality of servers.
- On a small 4 node cluster, Master nodes’ daemons can run on a Slave Node
- If the cluster has 10-20 nodes, master noes should be placed on a separate node
- Anything > 20 nodes then they should be on separate boxes.
- Typical base config for a slave node 4 x 1TB HDs:
- JBOD, no spanning, infact avoid raid controllers all together
- 2 quad core CPUs
- 24-32GB Ram
- The factor here is really amount of HDFS blocks for name node RAM and the # of MapR commands that run on slave nodes.
- Multiples of 1 HD, 2 core, 6-8 GB of RAM work for many types of applications
- Hadoop nodes are seldom CPU-bound
- Typically disk and network bound
- Each map or reduce task will take 1GB to 2GB of RAM:
- Ensure you have enough RAM to run all tasks plus overhead for the DN and TT daemons
- Rule of thumb:
- Total number of MR tasks = 1.5x number of cores
- Not definitive, you have to consider RAM
- 3.5 disks, 7200 RPM
- More disk spindles over larger spindles
- Good practice is 24TB per slave max.
- No raid, no virtualization, no blades
- Master nodes (JT, NN) are single points of failure.
- Spend money on these nodes fully redundant.
- Hadoop is network intensive
- Don’t cheap out on hardware
- 1GB on nodes
- Top of rack switches should be 10Gb/sec or faster
- Choose an OS you’re confortable with:
- CentOs widely used
- Cloudera often sees a mixture of the 2.
- No Linux LVM.
- Test diski/o:
- Hd parm –t /dev/sda1
- Looking for greater than 70 MB/sec
- Use a Common directory structure:
- i.e. /data/<n/dfs/nn for the namenode.
- Mount disks with the noatime:
- Disables append file write created every time a file is touched.
- Reduce the swappines of the system:
- Vm.swappiness to 0 or 5 in /etc/sysctl.conf
- This prevents linux from hogging remaining memory for file system cache after all daemons have started causing virtual memory usage.
- Use ext3 or 4 file systems:
- Increase the nofile ulimit for mapred and hdfs users to atleast 32k open files at a time:
- Setting is in /etc/security/limits.conf
- This increases the number of open files to a user.
- Disable IPv6, SELinux
- Use NTP
- Use oracle JDK, 1.6