Posts Tagged ‘hadoop’

Hadoop & Flume: Log Crunching

Thursday, December 15th, 2011

I’ve recently been experimenting with log crunching. Specifically, taking the log files of a handful of Apache servers and stuffing them all in a single location, perchance to make a nicer interface for scanning them for notable errors. Wouldn’t that be nice?

Nonetheless, one of the first steps in conquering this was to setup an area where all the logs would go. One could think of an NFS share or an ever-growing RAID volume like XFS or ZFS, but let’s think: what if instead of a handful of Apache servers, we had a few thousand of them? That RAID setup would be filled very quickly! Enter Hadoop – this system supports Map-Reduce functionality, where nodes of machines will map out a function to multiple machines and reduce it by computing each part, and HDFS (the “Hadoop File System”).

The Hadoop Elephant Logo

The Hadoop Elephant Logo

Particularly, I was looking into HDFS for now; this fancy filesystem is a distributed filesystem in that it will span across multiple machines, let alone across multiple disks if you place it atop a RAID setup. Not only will this save an immense number of log files, but replicate them and allow for a map-reduce functionality for parsing them. Awesome!

I started this by crabbing Cloudera’s hadoop package. I’ve followed a few papers and articles done by Cloudera and they seem pretty dependable, especially since they have their own github page even, so the world can view their source. Nonetheless, I snagged their hadoop binaries and extracted them on my linux box. I then setup a new hadoop user (via useradd -m in a root terminal) and changed the permissions so that this new hadoop user owned said directory (standard chown -R hadoop:hadoop /opt/hadoop/, since I extracted the hadoop binary package into /opt/hadoop/).

After this, it was surprisingly smooth sailing to a functioning single-server hadoop node. I created environmental variables for my hadoop user by editing its ~/.profile file and adding:

export HADOOP_HOME=/opt/hadoop
export PATH=$PATH:$HADOOP_HOME/bin

Afterwards, sourcing the file again so that my changes took effect (source ~/.profile). This allowed me to run the hadoop binaries from my bash shell without having to hop around.

I then edited the “core-site.xml” file under the hadoop configuration directory (normally $HADOOP_HOME/conf/core-site.xml). This file might not exist, but no worries. Here’s what I put into mine:

<configuration>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://my.hostname.here:9000</value>
  </property>
  <property>
    <name>dfs.data.dir</name>
    <value>/mnt/hdfs/data</value>
  </property>
  <property>
    <name>dfs.name.dir</name>
    <value>/mnt/hdfs/name</value>
  </property>
</configuration>

After setting this up, I then had to create & provide the correct permissions for the directories that our HDFS storage would be using, so from a root terminal:

mkdir -p /mnt/hdfs/data
mkdir -p /mnt/hdfs/name
sudo chown -R hadoop:hadoop /mnt/hdfs

Alright! So now we have all of our filesystems and configuration files taken care of… now just to format the HDFS set and start it up! So, switch to the hadoop user and run:

hadoop namenode -format

This will format our HDFS storage (namely, what we had in /mnt/hdfs/). You can imagine that this is like formatting your hard disk for a filesystem. The last piece of the puzzle, of course, is running the HDFS service so that our files can be replicated over and all that good stuff, so again as the hadoop user:

start-dfs.sh
This is a script stored in the $HADOOP_HOME/bin/ directory, “just in case” if you are unable to find it.

Now, onto Flume. This is where things started to crumble a bit sadly. I won’t go through a guide for this since it still remains unsuccessful and I don’t want to guide anyone into a brick wall.

Flume’s core is that it is able to pipe data. My circumstance specifically was 3 layers: Apache logs -> aggregate -> hadoop storage. I’d have a few (let’s pretend tends of thousands, but in reality only a handful) apache servers running, piping bits of information into its error and access logs. A flume process running on each apache server would simply be tailing the error logs at first (and add the access logs later after I got errors working) and pipe that into the aggregate. The aggregate flume process would, in turn, take all of these logs and drop them into the hadoop storage layer, allowing me to horde them forever and ever; a single node would be dedicated to this task. Sounds pretty simple, right?

Well, not so much. Flume’s configuration, surprisingly, wasn’t too difficult once you wrapped your head around the idea of sources and sinks (sources being the origin of the data and the sinks being the destination). You could easily pipe sources and sinks of data together with flume processes. The flume processes running on the Apache servers were haplessly sending data to the aggregate which was perfect. However, I think I hit what was a flume bug: FLUME-757, where there’s a race condition, so Flume spits out null usage errors. This put a rut in my plan, although it might have simply been user error (however I am pretty sure it was the bug ;)).

Further developments will be trying out fluentd, a similar data piping application, instead of flume to see if that provides excellent aggregate & piping functionality for my needs. Needless to say, I’ve been able to have my HDFS Hadoop layer working, now I simply have to either find the right solution or work out the bugs & kinks in the actual piping of data. Hopefully I’ll be able to make another post or an update to this with great success soon!