Archive for the ‘RAID’ Category

Hadoop & Flume: Log Crunching

Thursday, December 15th, 2011

I’ve recently been experimenting with log crunching. Specifically, taking the log files of a handful of Apache servers and stuffing them all in a single location, perchance to make a nicer interface for scanning them for notable errors. Wouldn’t that be nice?

Nonetheless, one of the first steps in conquering this was to setup an area where all the logs would go. One could think of an NFS share or an ever-growing RAID volume like XFS or ZFS, but let’s think: what if instead of a handful of Apache servers, we had a few thousand of them? That RAID setup would be filled very quickly! Enter Hadoop – this system supports Map-Reduce functionality, where nodes of machines will map out a function to multiple machines and reduce it by computing each part, and HDFS (the “Hadoop File System”).

The Hadoop Elephant Logo

The Hadoop Elephant Logo

Particularly, I was looking into HDFS for now; this fancy filesystem is a distributed filesystem in that it will span across multiple machines, let alone across multiple disks if you place it atop a RAID setup. Not only will this save an immense number of log files, but replicate them and allow for a map-reduce functionality for parsing them. Awesome!

I started this by crabbing Cloudera’s hadoop package. I’ve followed a few papers and articles done by Cloudera and they seem pretty dependable, especially since they have their own github page even, so the world can view their source. Nonetheless, I snagged their hadoop binaries and extracted them on my linux box. I then setup a new hadoop user (via useradd -m in a root terminal) and changed the permissions so that this new hadoop user owned said directory (standard chown -R hadoop:hadoop /opt/hadoop/, since I extracted the hadoop binary package into /opt/hadoop/).

After this, it was surprisingly smooth sailing to a functioning single-server hadoop node. I created environmental variables for my hadoop user by editing its ~/.profile file and adding:

export HADOOP_HOME=/opt/hadoop
export PATH=$PATH:$HADOOP_HOME/bin

Afterwards, sourcing the file again so that my changes took effect (source ~/.profile). This allowed me to run the hadoop binaries from my bash shell without having to hop around.

I then edited the “core-site.xml” file under the hadoop configuration directory (normally $HADOOP_HOME/conf/core-site.xml). This file might not exist, but no worries. Here’s what I put into mine:

<configuration>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://my.hostname.here:9000</value>
  </property>
  <property>
    <name>dfs.data.dir</name>
    <value>/mnt/hdfs/data</value>
  </property>
  <property>
    <name>dfs.name.dir</name>
    <value>/mnt/hdfs/name</value>
  </property>
</configuration>

After setting this up, I then had to create & provide the correct permissions for the directories that our HDFS storage would be using, so from a root terminal:

mkdir -p /mnt/hdfs/data
mkdir -p /mnt/hdfs/name
sudo chown -R hadoop:hadoop /mnt/hdfs

Alright! So now we have all of our filesystems and configuration files taken care of… now just to format the HDFS set and start it up! So, switch to the hadoop user and run:

hadoop namenode -format

This will format our HDFS storage (namely, what we had in /mnt/hdfs/). You can imagine that this is like formatting your hard disk for a filesystem. The last piece of the puzzle, of course, is running the HDFS service so that our files can be replicated over and all that good stuff, so again as the hadoop user:

start-dfs.sh
This is a script stored in the $HADOOP_HOME/bin/ directory, “just in case” if you are unable to find it.

Now, onto Flume. This is where things started to crumble a bit sadly. I won’t go through a guide for this since it still remains unsuccessful and I don’t want to guide anyone into a brick wall.

Flume’s core is that it is able to pipe data. My circumstance specifically was 3 layers: Apache logs -> aggregate -> hadoop storage. I’d have a few (let’s pretend tends of thousands, but in reality only a handful) apache servers running, piping bits of information into its error and access logs. A flume process running on each apache server would simply be tailing the error logs at first (and add the access logs later after I got errors working) and pipe that into the aggregate. The aggregate flume process would, in turn, take all of these logs and drop them into the hadoop storage layer, allowing me to horde them forever and ever; a single node would be dedicated to this task. Sounds pretty simple, right?

Well, not so much. Flume’s configuration, surprisingly, wasn’t too difficult once you wrapped your head around the idea of sources and sinks (sources being the origin of the data and the sinks being the destination). You could easily pipe sources and sinks of data together with flume processes. The flume processes running on the Apache servers were haplessly sending data to the aggregate which was perfect. However, I think I hit what was a flume bug: FLUME-757, where there’s a race condition, so Flume spits out null usage errors. This put a rut in my plan, although it might have simply been user error (however I am pretty sure it was the bug ;)).

Further developments will be trying out fluentd, a similar data piping application, instead of flume to see if that provides excellent aggregate & piping functionality for my needs. Needless to say, I’ve been able to have my HDFS Hadoop layer working, now I simply have to either find the right solution or work out the bugs & kinks in the actual piping of data. Hopefully I’ll be able to make another post or an update to this with great success soon!

How to run ZFS on Linux via FUSE

Saturday, July 10th, 2010

So today I decided it was time for me to research into the mythical ZFS filesystem. My curiosity for this is due to my interest of building a large multi-disk linux system in the near future.So today I decided it was time for me to research into the mythical ZFS filesystem. My curiosity for this is due to my interest of building a large multi-disk linux system in the near future.

I started by creating a new Virtual Machine within Virtual Box, which is a free Virtual Machine application from Sun Oracle. I created 7 virtual disks: One 8 GB disk for the main OS and 6x 2 GB disks, which I would test ZFS on. Afterward, I proceeded to install a standard stable debian system (sans the Desktop environment) on the 8GB partition. Once Debian booted up, it was time to get ZFS installed.

First step was to simply pull the ZFS FUSE module’s source down by doing the following:

wget http://zfs-fuse.net/releases/0.6.9/zfs-fuse-0.6.9.tar.bz2
tar -jxf zfs-fuse-0.6.9.tar.bz2
rm -rf zfs-fuse-0.6.9.tar.bz2

This provides a nice folder containing the ZFS FUSE module source code, amongst a few other things. Now, to take care of a few dependencies and required programs to build said module. I ran the following command to install glibc, zlib, fuse, aio, scons, libssl, and attr:

sudo aptitude install glibc-2.7-1 zlib1-gdev zlibc libfuse-dev libaio-dev scons libssl-dev attr-dev

Now that I finally had the dependencies and required programs for the module, I went about building it:

cd zfs-fuse-0.6.9/src/
scons
scons install

You can think of scons being similar to make, so in this step, I simply compiled the module, then installed it. Surprisingly quite simple. Make sure that you run at least the scons install command as a root (or sudo-ed) user.

Now, the only step left is to make sure that we automatically load the FUSE module and that the ZFS FUSE daemon automatically starts & mounts our ZFS pools on boot. To do this, I went through the following commands:

cd ../contrib/
echo "fuse" >> /etc/modules
cp zfs-fuse.initd.ubuntu /etc/init.d/zfs-fuse
update-rc.d zfs-fuse defaults

Keep in mind that all of these commands should be run as a root (or sudo-ed) user, save for the first. The first command simply changes the folder, while the second command adds the fuse module to be automatically loaded. The third command copies the provided script that automatically starts the ZFS FUSE daemon in Ubuntu and since Ubuntu is based upon Debian, I figured it would work for a Debian system – and it did. The final command, then, simply adds the ZFS FUSE daemon auto-start script to our boot process.

Now we get to the meat and potatoes: creating our ZFS pool. Run the following command to make a single logical volume from the 6 disks we created earlier:

zpool create tank raidz2 /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg -m /tank

This creates a 6-disk logical volume named “tank” and mounts it as /tank (you can obviously go with almost any mount point or naming scheme you want). Notice that I used /dev/sdb and so on as my drives – these may differ depending on how you setup your virtual hardware structure. One special keyword you will see is raidz2; what this means is that we are creating a logical volume employing the RAIDZ2 technique which places two chunks of parity on each disk. With the current version of ZFS, one can utilize RAIDZ1, RAIDZ2, and even RAIDZ3, each specifying the number of parity chunks. Additionally, there is also basic mirroring and striping support.

With that single command, I had a working thriving ZFS setup! I was floored at how simplistic the actual creation of the ZFS volume was after installing the module. I then checked the status of my ZFS pool to see the status of each disk and to see the size of the logical volume:

zpool status tank

This command will show the status of each RAIDZ and disk.

zpool list tank

This command will show the size and usage of each pool. For the one I created, it displayed 11.9 GB available. I then went through a scenario: what if I had 3 disks in a RAIDZ2, then wanted to add 3 more? After a bit of research, it seems that there is a bit of work to enable ZFS to expand RAIDZ configurations, but currently no such feature exists. Thus, a secondary RAIDZ must be added:

zpool create tank raidz2 /dev/sdb /dev/sdc /dev/sdd -m /tank
zpool add tank raidz2 /dev/sde /dev/sdf /dev/sdg

The first command of course creates a pool with 3 disks. The second command creates another RAIDZ2 set with 3 additional disks. Checking the status of the pool now will show you how there are two RAIDZ2 sets of 3 disks each. Checking the status again, it seems the storage amount remained the same (11.9 GB).

I then wanted to replicate an example of a disaster: a destroyed disk:

/etc/init.d/zfs-fuse stop
dd if=/dev/zero of=/dev/sdc bs=1M
dd if=/dev/zero of=/dev/sdf bs=1M
/etc/init.d/zfs-fuse start

This basically nukes two drives (one per RAIDZ2 setup). Checking the status shows that each disk is “unavailable” due to corrupted data, which is proper. Now, since we know the two virtual drives are in working order, we can simply notify ZFS that we have replaced the “bad” drives with good ones by running the following:

zpool replace tank /dev/sdc
zpool replace tank /dev/sdf

Which will have ZFS start rebuilding the RAIDZ2 setup – perfect! Alternatively, you can force the ZFS pool to resynchronize its data by running the following:

zpool scrub tank

From these experiments, it seems ZFS is an excellent solution for software RAID. Even so much that I am not sure if I will be going back to MDADM anytime soon. On the other hand, the XFS filesystem does claim to be good at logical volumes as well, but for now I can say that ZFS is simplistic, yet powerful.

nForce’s RAID Disappointment

Tuesday, January 6th, 2009

When planning my new system over the summer, I was determined to utilize a RAID 5 setup. Having 3 750 GB drives, I figured that a RAID 5 setup would yield me plenty of storage space (approximately 1396 GB) and also provide me with parity, to protect my data in case of one drive failing. In addition, the RAID 5 array would provide me with faster reading speeds, due tot he data being split between multiple drives. This setup worked flawlessly and was quite simple to use.

All was well until I decided to upgrade my storage capacity by purchasing an addition 750 GB drive (thus making a total of 4 drives). Utilizing 4 drives in the same RAID 5 setup would provide me with approximately 2095 GB of storage space. However, this is where nForce’s RAID begins to turn on me.

Attempting to utilize the nVidia Control Panel’s RAID functionality, I figured I would be able to expand my RAID 5 array, to simply add the new drive and allow my RAID to rebuild itself over the period of a day or so. This, however, was unable to be done – the only option I was provided was to convert my RAID array from a RAID 5 to a RAID 0+1 – not really what I was going for.

I contacted nVidia Technical Support to ask them about this issue; their response was that I should contact MSI in that my nVidia nForce BIOS were too old (version 6) while their newest version is version 9. In turn, I contacted MSI pertaining to the nVidia nForce BIOS. I was then told by MSI that no newer versions have been provided for nForce. I was caught between two companies in a never-ending customer support referral.

In the end, I ended up throwing caution to the wind, backing up my plethora of data and simply recreating my RAID array. However, this also caused additional problems as nForce’s RAID setup only allows one to create a 2 TB RAID at maximum – lovely. I however, figured I’d give it a random shot and booted into Windows (thankfully installed on a separate drive). Amazingly, Windows saw all 2095 glorious GB of space and all was almost well.

After rebooting, I have come to discover that every few reboots I have to recreate my RAID array by simply deleting the array and recreating it in the MediaShield BIOS (thus not clearing any data stored on the array); this somehow allows the MediaShield BIOS to shrug off the 2 TB limit and allows Windows to utilize all available space on the RAID array.

In the end, my solution is definitely not perfect and is highly limited by nForce’s limited support for modern RAID arrays. My best judgment, for now, is to simply not reboot unless I am within the vicinity of my computer and am able to recreate the RAID array “just in case”. I do severely hope that nVidia releases new patches to the nForce chipsets to hopefully solve this issue in the near future.