Hadoop & Flume: Log Crunching

December 15th, 2011

I’ve recently been experimenting with log crunching. Specifically, taking the log files of a handful of Apache servers and stuffing them all in a single location, perchance to make a nicer interface for scanning them for notable errors. Wouldn’t that be nice?

Nonetheless, one of the first steps in conquering this was to setup an area where all the logs would go. One could think of an NFS share or an ever-growing RAID volume like XFS or ZFS, but let’s think: what if instead of a handful of Apache servers, we had a few thousand of them? That RAID setup would be filled very quickly! Enter Hadoop – this system supports Map-Reduce functionality, where nodes of machines will map out a function to multiple machines and reduce it by computing each part, and HDFS (the “Hadoop File System”).

The Hadoop Elephant Logo

The Hadoop Elephant Logo

Particularly, I was looking into HDFS for now; this fancy filesystem is a distributed filesystem in that it will span across multiple machines, let alone across multiple disks if you place it atop a RAID setup. Not only will this save an immense number of log files, but replicate them and allow for a map-reduce functionality for parsing them. Awesome!

I started this by crabbing Cloudera’s hadoop package. I’ve followed a few papers and articles done by Cloudera and they seem pretty dependable, especially since they have their own github page even, so the world can view their source. Nonetheless, I snagged their hadoop binaries and extracted them on my linux box. I then setup a new hadoop user (via useradd -m in a root terminal) and changed the permissions so that this new hadoop user owned said directory (standard chown -R hadoop:hadoop /opt/hadoop/, since I extracted the hadoop binary package into /opt/hadoop/).

After this, it was surprisingly smooth sailing to a functioning single-server hadoop node. I created environmental variables for my hadoop user by editing its ~/.profile file and adding:

export HADOOP_HOME=/opt/hadoop
export PATH=$PATH:$HADOOP_HOME/bin

Afterwards, sourcing the file again so that my changes took effect (source ~/.profile). This allowed me to run the hadoop binaries from my bash shell without having to hop around.

I then edited the “core-site.xml” file under the hadoop configuration directory (normally $HADOOP_HOME/conf/core-site.xml). This file might not exist, but no worries. Here’s what I put into mine:

<configuration>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://my.hostname.here:9000</value>
  </property>
  <property>
    <name>dfs.data.dir</name>
    <value>/mnt/hdfs/data</value>
  </property>
  <property>
    <name>dfs.name.dir</name>
    <value>/mnt/hdfs/name</value>
  </property>
</configuration>

After setting this up, I then had to create & provide the correct permissions for the directories that our HDFS storage would be using, so from a root terminal:

mkdir -p /mnt/hdfs/data
mkdir -p /mnt/hdfs/name
sudo chown -R hadoop:hadoop /mnt/hdfs

Alright! So now we have all of our filesystems and configuration files taken care of… now just to format the HDFS set and start it up! So, switch to the hadoop user and run:

hadoop namenode -format

This will format our HDFS storage (namely, what we had in /mnt/hdfs/). You can imagine that this is like formatting your hard disk for a filesystem. The last piece of the puzzle, of course, is running the HDFS service so that our files can be replicated over and all that good stuff, so again as the hadoop user:

start-dfs.sh
This is a script stored in the $HADOOP_HOME/bin/ directory, “just in case” if you are unable to find it.

Now, onto Flume. This is where things started to crumble a bit sadly. I won’t go through a guide for this since it still remains unsuccessful and I don’t want to guide anyone into a brick wall.

Flume’s core is that it is able to pipe data. My circumstance specifically was 3 layers: Apache logs -> aggregate -> hadoop storage. I’d have a few (let’s pretend tends of thousands, but in reality only a handful) apache servers running, piping bits of information into its error and access logs. A flume process running on each apache server would simply be tailing the error logs at first (and add the access logs later after I got errors working) and pipe that into the aggregate. The aggregate flume process would, in turn, take all of these logs and drop them into the hadoop storage layer, allowing me to horde them forever and ever; a single node would be dedicated to this task. Sounds pretty simple, right?

Well, not so much. Flume’s configuration, surprisingly, wasn’t too difficult once you wrapped your head around the idea of sources and sinks (sources being the origin of the data and the sinks being the destination). You could easily pipe sources and sinks of data together with flume processes. The flume processes running on the Apache servers were haplessly sending data to the aggregate which was perfect. However, I think I hit what was a flume bug: FLUME-757, where there’s a race condition, so Flume spits out null usage errors. This put a rut in my plan, although it might have simply been user error (however I am pretty sure it was the bug ;) ).

Further developments will be trying out fluentd, a similar data piping application, instead of flume to see if that provides excellent aggregate & piping functionality for my needs. Needless to say, I’ve been able to have my HDFS Hadoop layer working, now I simply have to either find the right solution or work out the bugs & kinks in the actual piping of data. Hopefully I’ll be able to make another post or an update to this with great success soon!

How to run ZFS on Linux via FUSE

July 10th, 2010

So today I decided it was time for me to research into the mythical ZFS filesystem. My curiosity for this is due to my interest of building a large multi-disk linux system in the near future.So today I decided it was time for me to research into the mythical ZFS filesystem. My curiosity for this is due to my interest of building a large multi-disk linux system in the near future.

I started by creating a new Virtual Machine within Virtual Box, which is a free Virtual Machine application from Sun Oracle. I created 7 virtual disks: One 8 GB disk for the main OS and 6x 2 GB disks, which I would test ZFS on. Afterward, I proceeded to install a standard stable debian system (sans the Desktop environment) on the 8GB partition. Once Debian booted up, it was time to get ZFS installed.

First step was to simply pull the ZFS FUSE module’s source down by doing the following:

wget http://zfs-fuse.net/releases/0.6.9/zfs-fuse-0.6.9.tar.bz2
tar -jxf zfs-fuse-0.6.9.tar.bz2
rm -rf zfs-fuse-0.6.9.tar.bz2

This provides a nice folder containing the ZFS FUSE module source code, amongst a few other things. Now, to take care of a few dependencies and required programs to build said module. I ran the following command to install glibc, zlib, fuse, aio, scons, libssl, and attr:

sudo aptitude install glibc-2.7-1 zlib1-gdev zlibc libfuse-dev libaio-dev scons libssl-dev attr-dev

Now that I finally had the dependencies and required programs for the module, I went about building it:

cd zfs-fuse-0.6.9/src/
scons
scons install

You can think of scons being similar to make, so in this step, I simply compiled the module, then installed it. Surprisingly quite simple. Make sure that you run at least the scons install command as a root (or sudo-ed) user.

Now, the only step left is to make sure that we automatically load the FUSE module and that the ZFS FUSE daemon automatically starts & mounts our ZFS pools on boot. To do this, I went through the following commands:

cd ../contrib/
echo "fuse" >> /etc/modules
cp zfs-fuse.initd.ubuntu /etc/init.d/zfs-fuse
update-rc.d zfs-fuse defaults

Keep in mind that all of these commands should be run as a root (or sudo-ed) user, save for the first. The first command simply changes the folder, while the second command adds the fuse module to be automatically loaded. The third command copies the provided script that automatically starts the ZFS FUSE daemon in Ubuntu and since Ubuntu is based upon Debian, I figured it would work for a Debian system – and it did. The final command, then, simply adds the ZFS FUSE daemon auto-start script to our boot process.

Now we get to the meat and potatoes: creating our ZFS pool. Run the following command to make a single logical volume from the 6 disks we created earlier:

zpool create tank raidz2 /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg -m /tank

This creates a 6-disk logical volume named “tank” and mounts it as /tank (you can obviously go with almost any mount point or naming scheme you want). Notice that I used /dev/sdb and so on as my drives – these may differ depending on how you setup your virtual hardware structure. One special keyword you will see is raidz2; what this means is that we are creating a logical volume employing the RAIDZ2 technique which places two chunks of parity on each disk. With the current version of ZFS, one can utilize RAIDZ1, RAIDZ2, and even RAIDZ3, each specifying the number of parity chunks. Additionally, there is also basic mirroring and striping support.

With that single command, I had a working thriving ZFS setup! I was floored at how simplistic the actual creation of the ZFS volume was after installing the module. I then checked the status of my ZFS pool to see the status of each disk and to see the size of the logical volume:

zpool status tank

This command will show the status of each RAIDZ and disk.

zpool list tank

This command will show the size and usage of each pool. For the one I created, it displayed 11.9 GB available. I then went through a scenario: what if I had 3 disks in a RAIDZ2, then wanted to add 3 more? After a bit of research, it seems that there is a bit of work to enable ZFS to expand RAIDZ configurations, but currently no such feature exists. Thus, a secondary RAIDZ must be added:

zpool create tank raidz2 /dev/sdb /dev/sdc /dev/sdd -m /tank
zpool add tank raidz2 /dev/sde /dev/sdf /dev/sdg

The first command of course creates a pool with 3 disks. The second command creates another RAIDZ2 set with 3 additional disks. Checking the status of the pool now will show you how there are two RAIDZ2 sets of 3 disks each. Checking the status again, it seems the storage amount remained the same (11.9 GB).

I then wanted to replicate an example of a disaster: a destroyed disk:

/etc/init.d/zfs-fuse stop
dd if=/dev/zero of=/dev/sdc bs=1M
dd if=/dev/zero of=/dev/sdf bs=1M
/etc/init.d/zfs-fuse start

This basically nukes two drives (one per RAIDZ2 setup). Checking the status shows that each disk is “unavailable” due to corrupted data, which is proper. Now, since we know the two virtual drives are in working order, we can simply notify ZFS that we have replaced the “bad” drives with good ones by running the following:

zpool replace tank /dev/sdc
zpool replace tank /dev/sdf

Which will have ZFS start rebuilding the RAIDZ2 setup – perfect! Alternatively, you can force the ZFS pool to resynchronize its data by running the following:

zpool scrub tank

From these experiments, it seems ZFS is an excellent solution for software RAID. Even so much that I am not sure if I will be going back to MDADM anytime soon. On the other hand, the XFS filesystem does claim to be good at logical volumes as well, but for now I can say that ZFS is simplistic, yet powerful.

Android Threads

March 3rd, 2010


For the past few weeks, I have been looking into development for the Android mobile device operating system. Development for this operating system requires the use of the Java programming language with additional libraries provided by the publicly available Android SDK. As such, I’ve ventured back into the world of Java, as well.

Hopping into Android development is very easy by utilizing the provided Android Tutorials. However, there are a few tricks here & there that one does not easily find, which brings me to the current blog post.

Threading in Android one would believe would be identical to how one creates and handles threading in Java. However, there are a few intricacies that one must obey in Android whereas in Java they need not be required. For example, in the following Java class, it displays a simple text field that a thread updates with the current UNIX timestamp:

import javax.swing.JFrame;
import javax.swing.JTextField;

import java.awt.event.WindowAdapter;
import java.awt.event.WindowEvent;

public class TestJava extends JFrame implements Runnable {

	private JTextField textField = null;
	private boolean runThread = true;

	public TestJava() {
		this.textField = new JTextField();
		this.add(this.textField);
		this.setTitle("Testing Java");
		this.addWindowListener(new WindowAdapter() {
			public void windowClosing(WindowEvent we) {
				runThread = false;
				System.exit(0);
				return;
			}
		});
		Thread counterThread = new Thread(this);
		counterThread.start();
		this.pack();
		this.setVisible(true);
		return;
	}

	public void run() {
		while(runThread) {
			this.textField.setText("" + System.currentTimeMillis());
		}
		return;
	}

	public static void main(String[] args) {
		TestJava test = new TestJava();
		return;
	}
}

This is pretty straight-forward, modifying the JTextField from within the run function. However, in Android, only the thread that owns the UI interface can update said interface, which causes a few complications. The following class in Android will perform (roughly) the exact same as the prior class did within the standard Java libraries:

package com.andrewmkane.test;

import android.app.Activity;
import android.os.Bundle;
import android.os.Handler;
import android.os.Message;
import android.widget.TextView;

public class ThreadActivity extends Activity implements Runnable {

	private TextView textview = null;
	private boolean performThread = false;

	protected void onCreate(Bundle savedInstanceState) {
		super.onCreate(savedInstanceState);
		textview = new TextView(this);
		this.setContentView(textview);
		return;
	}

	protected void onStart() {
		super.onStart();
		this.performThread = true;
		Thread counterThread = new Thread(this);
		counterThread.start();
		return;
	}

	protected void onStop() {
		this.performThread = false;
		super.onStop();
		return;
	}

	public void run() {
		while(performThread) {
			handler.sendEmptyMessage(0);
		}
		return;
	}

	private Handler handler = new Handler() {
		public void handleMessage(Message msg) {
			textview.setText("" + System.currentTimeMillis());
			return;
		}
	};
}

In the class above, you’ll notice that instead of setting the text of the TextView (akin to a JTextField), we set the value within the private Handler object. This is done because the run function is housed within a separate thread, while the Handler object is held within the primary UI thread.

This way, our secondary thread sends a message to the Handler located in the primary thread. The primary thread, which has access to the UI, then is able to modify the TextView without being denied access, whereas the secondary thread (within the run function) would have been. It’s a bit complicated, but makes sense – limiting the UI to a single thread allows developers to run processes in the background without accidentally mucking up the UI all the while making sure the UI has a dedicated thread to maintain decent performance.

Parsing QR Codes

April 16th, 2009

For a class I am currently taking, we started to explore various ways of ensuring that a user has visited or is visiting a certain location, utilizing the common cellular phone. Our first thought of utilizing the phone’s geo-locating device, but quickly realized the amount of legal issues that would probably arise from such an endeavorer. We then thought of an alternate way: pictures! Most modern cellular phones have a camera built in and, worst possible case, a user surely has a camera of some sort at their disposal. Our idea consisted of utilizing QR Codes to ensure that a user had visited or is currently visiting a certain location.

Good Humorous Example of a QR Code

Good Humorous Example of a QR Code

QR Codes contain data in a 2D matrix, similar to our everyday barcodes which are simply 1D. Therefore, QR Codes can contain more data and simply look much cooler! Our plan is to simply encode a brief message into a QR Code and place it at said location for users to photograph and upload later if they do not have a camera-enabled phone, otherwise they would be able to take a picture of it and forward it to us via MMS message or e-mail.

The issue after this, however, was how to identify their QR Codes as being correct from a technology stand point. I figured that my current LAMP server would serve as an ideal testing grounds for this and I slowly was able to put together a sample QR Code parser. However, oddly enough, I was completely unable to discover a simple PHP QR Code decoder/parser. Instead of recoding the wheel per-say, I found an excellent Japanese QR Code decoder written in Java. Utilizing this, I was able to rig my PHP script up for excellent QR Code parsing! The fancy script I ended up with is below.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
	<title>QR Code Testing</title>
	<meta http-equiv="content-type" content="text/html; charset=utf-8" />
</head>
<body>
	<div id="qrcode_text">
<?php
	if(isset($_FILES["qrcode"]) && isset($_FILES["qrcode"]["name"]) && strlen($_FILES["qrcode"]["name"]) > 0) {
		$img_file = mt_rand(0000, 9999) . $_FILES["qrcode"]["name"];
		if(move_uploaded_file($_FILES["qrcode"]["tmp_name"], "./temp/" . $img_file)) {
			$output = shell_exec("java -jar qrcode-cui.jar ./temp/" . $img_file . " 2>/dev/null");
			exec("rm -rf ./temp/" . $img_file);
			if(!$output || strlen($output) <= 0)
				echo "<b>Error</b><br/>It seems the QR Code was invalid or did not have any content";
			else
				echo "<b>QR Code Content</b><br/>" . $output;
		}
	} else if(isset($_POST["qrcode_url"])) {
			$qr_url = str_replace("`", "\`", str_replace("\\", "\\\\", str_replace(";", "", str_replace("&", "\&", $_POST["qrcode_url"]))));
			$output = shell_exec("java -jar qrcode-cui.jar \"" . $qr_url . "\" 2>/dev/null");
			if(!$output || strlen($output) <= 0)
				echo "<b>Error</b><br/>It seems the QR Code was invalid or did not have any content";
			else
				echo "<b>QR Code Content</b><br/>" . $output;
	}
?>
	</div>
	<div id="qrcode_form">
		<form action="#" method="post" enctype="multipart/form-data">
			<p style="display: none;"><input type="hidden" name="MAX_FILE_SIZE" value="2048000" /></p>
			<p><label for="qrcode">QR Code Image</label> <input type="file" id="qrcode" name="qrcode" /></p>
			<p><b>OR</b></p>
			<p><label for="qrcode_url">QR Code Image URL</label> <input type="text" id="qrcode_url" name="qrcode_url" /></p>
			<p><input type="submit" value="Check QR Code" /></p>
		</form>
	</div>
	<div>
		<p>Don't know what a QR Code is? Here's an example!</p>
		<img src="./qrcode_test.png" alt="A Sample QR Code" width="268" height="268" />
		<p>QR Codes are 2D barcodes! They hold much more information than our common 1 dimensional barcodes</p>
	</div>
</body>
</html>

So far, so good. If you have any questions or comments, feel free to comment below! You are also free to use the source code I have provided.

Google Maps hits a baby deer

January 29th, 2009

It seems that images have been removed from Google Maps, but were luckily cached by the public when they were available detailing a Google Maps vehicle accidentally hitting a deer.

It seems the van pulled off to the side of the road, due to the mismatch of the markings on the road. Horrible, but also a bit comical.

I’ve always been very humored about the various things you could find on Google Maps, such as the large KFC ad:
httpv://www.youtube.com/watch?v=fsH4gws35ro

There are also a few car accidents, as well as breaks at the gas station, and my friend Zack.

E-Mail Transfer via Thunderbird

January 29th, 2009

Recently my website host, who will go unnamed for now, has been having immense trouble: their e-mail servers continually are having issues, as well as going down constantly, in addition to having security issues and having to reset every user’s password – incredibly annoying.

Nonetheless, I decided it was time to move my website and e-mail from their servers to an alternate one. Backing up the website was no issue using a FTP client, of course, and backing up the MySQL tables wasn’t too much of an issue (other than one MySQL table belonging to Jack turned out being enormous).

Then, it came to porting the e-mail from the current server to the new one, which is where our fun starts! I figured I might as well check to see if the raw mail files were available via FTP (as they were a few years ago). Turns out, not so much anymore; a tech support representative explained to me that root-level access would be required to obtain such files. Oh lovely! Therefore, it was up to me to figure out a way to transfer all of these e-mails.

I scoured the internet looking for ideas and possibilities, most featuring fetchmail as a simple local mail backup to a mbox format. In the end, it was my e-mail client, Thunderbird, that came to my rescue.

I created two accounts within Thunderbird – one of my current account and another of the account I will be switching to, both utilizing IMAP (so that the e-mails are stored server-side and client-side, which I’ll get to). After this, I went to the inbox of my current account, selected all of the e-mails (CTRL+A is extremely helpful), right-clicked and selected “Copy To” and then my new account’s Inbox.

It takes quite a while, depending on the number of e-mails your current account has (in my case, I have over 8,000 so it took quite a while). Amazingly, it worked flawlessly; all of my e-mails were moved very easily into the new account’s inbox and, because i was using IMAP for my new account, the e-mails were also placed on the server with the correct timestamps and sender!

Now that I am using a new e-mail server, I no longer have to worry about random downtime or other various issues and can simply use my e-mail like normal.

Pidgin Surveys Users

January 7th, 2009
The Pidgin Pigeon

The Pidgin Pigeon

Recently I noticed that the version of Pidgin I was using was mildly out of date – using version 2.5.2 instead of the current 2.5.3; not a major issue, but I figured there would be no harm in updating. Interestingly enough, when on the Pidgin website, I noticed that Pidgin was holding a survey of their current users.

Normally I am very apathetic towards most surveys, as they are utilized for marketing to push more of a product. However, with Pidgin’s survey, it contained questions pertaining to the current setup, layout, and configuration of Pidgin, as well as additional features that users wanted.

I would highly recommend that everyone who uses Pidgin take the survey (located at here) to help the developers identify how to continue with the application. I, for instance, voted highly for the voice & video features, as those are the only two features I regularly switch to other messengers for (usually Skype).

Nonetheless, I hope that the Pidgin developers utilize this survey to achieve priority on features, to continue giving users the best instant messaging program by far.

Edit: Pidgin has posted survey results at http://pidgin.im/survey/results/

nForce’s RAID Disappointment

January 6th, 2009

When planning my new system over the summer, I was determined to utilize a RAID 5 setup. Having 3 750 GB drives, I figured that a RAID 5 setup would yield me plenty of storage space (approximately 1396 GB) and also provide me with parity, to protect my data in case of one drive failing. In addition, the RAID 5 array would provide me with faster reading speeds, due tot he data being split between multiple drives. This setup worked flawlessly and was quite simple to use.

All was well until I decided to upgrade my storage capacity by purchasing an addition 750 GB drive (thus making a total of 4 drives). Utilizing 4 drives in the same RAID 5 setup would provide me with approximately 2095 GB of storage space. However, this is where nForce’s RAID begins to turn on me.

Attempting to utilize the nVidia Control Panel’s RAID functionality, I figured I would be able to expand my RAID 5 array, to simply add the new drive and allow my RAID to rebuild itself over the period of a day or so. This, however, was unable to be done – the only option I was provided was to convert my RAID array from a RAID 5 to a RAID 0+1 – not really what I was going for.

I contacted nVidia Technical Support to ask them about this issue; their response was that I should contact MSI in that my nVidia nForce BIOS were too old (version 6) while their newest version is version 9. In turn, I contacted MSI pertaining to the nVidia nForce BIOS. I was then told by MSI that no newer versions have been provided for nForce. I was caught between two companies in a never-ending customer support referral.

In the end, I ended up throwing caution to the wind, backing up my plethora of data and simply recreating my RAID array. However, this also caused additional problems as nForce’s RAID setup only allows one to create a 2 TB RAID at maximum – lovely. I however, figured I’d give it a random shot and booted into Windows (thankfully installed on a separate drive). Amazingly, Windows saw all 2095 glorious GB of space and all was almost well.

After rebooting, I have come to discover that every few reboots I have to recreate my RAID array by simply deleting the array and recreating it in the MediaShield BIOS (thus not clearing any data stored on the array); this somehow allows the MediaShield BIOS to shrug off the 2 TB limit and allows Windows to utilize all available space on the RAID array.

In the end, my solution is definitely not perfect and is highly limited by nForce’s limited support for modern RAID arrays. My best judgment, for now, is to simply not reboot unless I am within the vicinity of my computer and am able to recreate the RAID array “just in case”. I do severely hope that nVidia releases new patches to the nForce chipsets to hopefully solve this issue in the near future.