Tuesday, May 17, 2011

GlusterFS: replace-brick

GlusterFS has a volume command called replace-brick, intuitively it replaces one brick with another. However, the way it works is not quite intuitive but requires some understanding before actually trying it out.



Let us say the cluster looks like this:

# gluster volume info

Volume Name: rb-test
Type: Distribute
Status: Started
Number of Bricks: 4
Transport-type: tcp
Bricks:
Brick1: 192.168.1.95:/data/distributestore/store-1
Brick2: 192.168.1.95:/data/distributestore/store-2
Brick3: 192.168.1.96:/data/distributestore/store-1
Brick4: 192.168.1.96:/data/distributestore/store-2

To replace the brick 192.168.1.95:/data/distributestore/store-2 with 192.168.1.77:/data/distributestore/store-2 we run the following command.

# gluster volume replace-brick rb-test \ 192.168.1.95:/data/distributestore/store-2 \ 192.168.1.77:/data/distributestore/store-2 start

This command basically migrates the data from
192.168.1.95:/data/distributestore/store-2 to 192.168.1.77:/data/distributestore/store-2

But actual brick replacement is not done on the volume yet. After the completion of the above command the data is present both on 192.168.1.95:/data/distributestore/store-2 and
192.168.1.77:/data/distributestore/store-2.

To include the brick 192.168.1.77:/data/distributestore/store-2 to the cluster the command,

# gluster volume replace-brick rb-test \ 192.168.1.95:/data/distributestore/store-2 \
192.168.1.77:/data/distributestore/store-2 commit

has to be run. This attaches the brick to the cluster. After the above command volume info looks like this:

[root@centos5 store-2]# gluster volume info rb-test

Volume Name: rb-test
Type: Distribute
Status: Started
Number of Bricks: 4
Transport-type: tcp
Bricks:
Brick1: 192.168.1.95:/data/distributestore/store-1
Brick2: 192.168.1.77:/data/distributestore/store-2
Brick3: 192.168.1.96:/data/distributestore/store-1
Brick4: 192.168.1.96:/data/distributestore/store-2

If you need to just create a backup of a brick just run `gluster volume replace-brick <VOLNAME> <BRICK> <NEW-BRICK> start' ;-).

The volume command `replace-brick' has other sub-commands viz pause, abort, status which are pretty much intuitive. See http://goo.gl/F3Lfw for more details on them.

Sunday, May 15, 2011

Shell: Reading directory with spaces in filename

My friend asked me how do you read a directory which has spaces in file names and assign them to a variable.

For example if my directory test has a file `foo bar.txt'

for file in `ls`; do
    echo "$file"
done


does not work, the variable file is assigned `foo' and `bar.txt'.


There is more than one way to solve this problem.


1. This works on most of the systems.


for file in *; do
    echo "$file"
done


2. Will definitely work.


IFS='
'
for file in `ls`; do
    echo "$file"
done


or


IFS=`echo -en "\n\b"`
for file in `ls`; do
    echo "$file"
done


3. One of my favorites and most efficient.


find . -type f -print0 | while read -d $'\0' file; do
     echo "$file"
done



Wednesday, May 11, 2011

Sparse files: du vs. ls -h

When the disk size is checked for a sparse file using `ls -lh' and `du -h' they report different sizes. This is not because of a bug in either du/ls or the filesystem itself, but the way du and ls calculate the file size.

du uses fts(3) to walk the filesystem and calculates the sizes of the files (this is on the BSDs, I haven't checked the GNU version of du but it will be more or less the same).

du calculates the size by looking into *fts_statp->st_blocks from the FTSENT structure, which is the number of allocated blocks with block size 512 bytes. Hence the size is calculated as *fts_statp->st_blocks * 512, fts_statp is the struct *stat from the stat(2) call.


On the other hand `ls -l' uses the same above method as `du' but looks into the st_size field of the stat structure to determine the file size. This gives the size of the file than the actual allocated blocks.


References: 

http://cvsweb.netbsd.org/bsdweb.cgi/src/usr.bin/du/du.c?rev=1.34&content-type=text/x-cvsweb-markup&only_with_tag=MAIN


http://cvsweb.netbsd.org/bsdweb.cgi/src/bin/ls/ls.c?rev=1.68&content-type=text/x-cvsweb-markup&only_with_tag=MAIN

Saturday, May 7, 2011

ReiserFS doesn't like GlusterFS?

ReiserFS comes as a default filesystem on my Slackware box and somehow I notice that ReiserFS does not seem to like GlusterFS. The system locks up while doing xattr operations and dmesg reports bunch of warnings before locking up:

[ 229.328509] REISERFS warning (device sda6): jdm-20001 reiserfs_xattr_get: Invalid magic for xattr (trusted.gfid) associated with [10723 11732 0x0 SD]
[ 229.328973] REISERFS warning (device sda6): jdm-20001 reiserfs_xattr_get: Invalid magic for xattr (trusted.gfid) associated with [10723 11732 0x0 SD]
[ 229.330871] REISERFS warning (device sda6): jdm-20001 reiserfs_xattr_get: Invalid magic for xattr (trusted.gfid) associated with [10723 11732 0x0 SD]
[ 229.332178] REISERFS warning (device sda6): jdm-20001 reiserfs_xattr_get: Invalid magic for xattr (trusted.gfid) associated with [10723 11732 0x0 SD]
[ 229.333257] REISERFS warning (device sda6): jdm-20001 reiserfs_xattr_get: Invalid magic for xattr (trusted.gfid) associated with [10723 11732 0x0 SD]
[ 229.334390] REISERFS warning (device sda6): jdm-20001 reiserfs_xattr_get: Invalid magic for xattr (trusted.gfid) associated with [10723 11732 0x0 SD]
[ 229.335931] REISERFS warning (device sda6): jdm-20001 reiserfs_xattr_get: Invalid magic for xattr (trusted.gfid) associated with [10723 11732 0x0 SD]

This happened twice, I am not sure if it is a co-incidence, however I will try to reproduce it more definitively before concluding. This takes a very low priority for two reasons:

a. GlusterFS is not recommended with ReiserFS
b. ReiserFS itself is no longer being developed or supported.

I have used ReiserFS as backend for GlusterFS before and this didn't seem to happen. Or maybe the usage pattern was different, I'm not sure.

Monday, May 2, 2011

GlusterFS defaults

I recently came across a question in the Gluster mailing list which discussed about default options set for the translators. The command `gluster volume info' lists the options re-configured. But there is no way to know the default configurations (as of now), other than to check the volume files and the code.
I have listed the default options below. All the options are as of 3.2.x and applies to 3.1.x.
GlusterFS servers by default have the following translators enabled 
  • features/access-control
  • features/locks
  • performance/io-threads
  • features/marker
  • debug/io-stats
Client has the following performance translators enabled by default
  • performance/write-behind
  • performance/read-ahead
  • performance/io-cache
  • performance/quick-read
  • performance/stat-prefetch
  • debug/io-stats
Below, I've listed the default options that are set for the above translators.
Server translator defaults
features/locks
trace off
performance/io-threads
  • thread-count 16 (minimun 1, maximum 64)
  • idle-time 1
features/marker 
  • volume-uuid - UUID of the volume
  • timestamp-file - path
  • xtime - off
  • quota - off
debug/io-stats
  • dump-fd-stats off
  • latency-measurement off
  • count-fop-hits off
  • log-level INFO
Client translator defaults:   
performance/write-behind
  • flush-behind on
  • cache-size/window-size 1MB (Min 512KB, Max 1GB)
  • disable-for-first-nbytes 0
  • enable-O_SYNC disabled by default
  • enable-trickling-writes on
performance/read-ahead
  • force-atime-update false
  • page-count 4 (Min 1, Max 16)
  • performance/io-cache
    • priority 1
    • cache-timeout/force-revalidate-timeout 1sec (min 0, max 60)
    • cache-size 32MB (min 4MB, max 6GB)
    • min-file-size 0
    • max-file-size -1 (Limited by the underlying FS)
    performance/quick-read
    • priority 1 cache-size 128MB (min 0, max 6GB)
    • cache-timeout 1 (min 1, max 60)
    • max-file-size 64KB (min 0, max 1MB)
    performance/stat-prefetch
    This translator does not have any options to configure.

    Tuesday, April 12, 2011

    Using loop devices to test GlusterFS

    I had come across a issue when one of our users had questions like does X happen when one of the nodes in GlusterFS is almost full? Or does Y happen if one of the nodes is full? Or does GlusterFS work at all if couple of nodes are full?

    Though the answer was straight forward I thought it would be better to test the functionality under those conditions before answering the obvious.

    Initially I thought about launching a few VMs to do a quick test. But the partition sizes were too huge for my tests, it was going to be a long wait before I fill up the nodes. As a alternative I had to create smaller partitions which involves fdisk et al., and work back to restore to original disk layout (if necessary).

    The better solution for this type of test would be to create a few huge files with `dd' command and use them as Gluster exports.

    For example:

    sac@odin:/data/disks $ for i in {1..4}; do
    > dd if=/dev/zero of=disk$i bs=256k count=1000
    > done
    sac@odin:/data/disks $

    Create a filesystem on the data files.

    root@odin:/root # for i in {1..4}; do
    > mkfs.ext3 /data/disks/disk$i
    > done
    root@odin:/root #

    Mount the filesystem via the loop device

    root@odin:/root # mkdir /mnt/{1..4}
    root@odin:/root # for i in {1..4}; do
    > mount /data/disks/disk$i /mnt/$i -o loop
    > done
    root@odin:/root #

    Now we have four partitions with the sizes we want pretty cheaply, without needing to have multiple servers or partitions.

    root@odin:/root # df -h /mnt/*
    Filesystem Size Used Avail Use% Mounted on
    /dev/loop0 2.9G 69M 2.7G 3% /mnt/1
    /dev/loop1 2.9G 70M 2.7G 3% /mnt/2
    /dev/loop2 2.9G 70M 2.7G 3% /mnt/3
    /dev/loop3 2.9G 70M 2.7G 3% /mnt/4
    root@odin:/root #

    These mount points are then used as export directories and can be played around to understand Gluster behavior when one of the partitions is filled up. Or the performance observations by building filesytems with various different flags.

    Conclusion:

    This is a fast and cheaper way to test GlusterFS functionality under various filesystems without having to bother about getting disks and creating partitions. The advantage is we need not repartition the disks to create different sized partitions, we can delete the file and create a new file with a different size. Better for functionality testing, sucks in performance though. Gluster behavior can be quickly tested over various filesystems before setting up dedicated disks for extensive testing. Building filesytems with various options and tuning for observing GlusterFS behavior is very easy.

    Sunday, April 10, 2011

    Gluster 3.0.x to 3.1.x migration

    Migrating from GlusterFS 3.0.x to 3.1.x is explained in http://bit.ly/ibgF6K, however this migration process leaves room for errors. More precaution is necessary while migrating between these major versions.

    I have listed a few steps which have to be followed during the 3.0.x to 3.1.x migration. And the error one might encounter due to faulty migration and the steps to overcome them.

    One of the new things that came in 3.1 is a concept called gfid, gfid is a extended attribute that gets set to every file and directory on a GlusterFS file system. So, essentially after migrating to 3.1 every file that is accessed from the mount point hence after is assigned this new extended attribute.

    As a first step after migrating from 3.0.x to 3.1.x is to mount the cluster with a `single' client and run stat on the mount point recursively for e.g `ls -lR' >/dev/null. Double check if other clients are accessing the cluster, and shut them down.

    After upgrade, if more than one client accesses the cluster, there is a possibility that directories on the backends might end up with different gfids, I have illustrated this in the below example. In such cases directory or file removal fails and you might see some unexpected behaviors. Below is a error due to gfid mismatch...

    root@odin:/mnt/distribute1# rm -rf glusterfs-3.*
    rm: cannot remove `glusterfs-3.0.5/extras/volgen': Directory not empty


    The fix for this is to recognize such directories and remove the extended attribute trusted.gfid on the backend and run stat from the mount point. Make sure no other clients are accessing these directories at the same time.

    A illustration of how it looks:

    root@odin:/mnt/distribute1# rm -rf glusterfs-3.2.0qa8/
    rm: cannot remove `glusterfs-3.2.0qa8/: Directory not empty
    root@odin:/mnt/distribute1#

    Examining the backend I see libglusterfs is a directory within glusterfs-3.2.0qa8/ examining further you see...

    root@odin:/media# find /media/ -type d -name 'libglusterfs' | \
    xargs -d'\n' getfattr -d -m trusted.gfid -e hex

    getfattr: Removing leading '/' from absolute path names
    # file: media/5/glusterfs-3.2.0qa8/libglusterfs
    trusted.gfid=0x9c3986db772d413a97ba79549b57370f

    # file: media/4/glusterfs-3.2.0qa8/libglusterfs
    trusted.gfid=0x8ae60902d0894c7ea52ad1061ee1e158

    # file: media/1/glusterfs-3.2.0qa8/libglusterfs
    trusted.gfid=0x8ae60902d0894c7ea52ad1061ee1e158

    # file: media/3/glusterfs-3.2.0qa8/libglusterfs
    trusted.gfid=0x8ae60902d0894c7ea52ad1061ee1e158

    # file: media/2/glusterfs-3.2.0qa8/libglusterfs
    trusted.gfid=0x9c3986db772d413a97ba79549b57370f


    Notice that the gfids are not same on all the backends, which is a requirement.

    Solution:

    On the backend remove the extended attribute trusted.gfid for the problem directory:

    root@odin:/media# find /media/ -type d -name 'libglusterfs' | \
    xargs -d'\n' setfattr -x trusted.gfid
    root@odin:/media# find /media/ -type d -name 'libglusterfs' | \
    xargs -d'\n' getfattr -d -m trusted.gfid -e hex
    root@odin:/media#

    No attributes all right. Now run a stat on the mount point to fix the gfid.

    root@odin:/mnt/distribute1# stat glusterfs-3.2.0qa8/libglusterfs

    File: `glusterfs-3.2.0qa8/libglusterfs'
    Size: 20480 Blocks: 80 IO Block: 131072 directory
    Device: 16h/22d Inode: 39460 Links: 2
    Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
    Access: 2011-04-11 17:25:52.000000000 +0530
    Modify: 2011-04-11 16:50:33.000000000 +0530
    Change: 2011-04-11 17:27:00.000000000 +0530
    root@odin:/mnt/distribute1#

    On the backend the directory should now have same gfid for the directory on all the nodes.

    root@odin:/media# find /media/ -type d -name 'libglusterfs' | \

    xargs -d'\n' getfattr -d -m trusted.gfid -e hex
    getfattr: Removing leading '/' from absolute path names
    # file: media/5/glusterfs-3.2.0qa8/libglusterfs
    trusted.gfid=0xcfeeacae15b54738b8fc6d60bd1ff05c

    # file: media/4/glusterfs-3.2.0qa8/libglusterfs
    trusted.gfid=0xcfeeacae15b54738b8fc6d60bd1ff05c

    # file: media/1/glusterfs-3.2.0qa8/libglusterfs
    trusted.gfid=0xcfeeacae15b54738b8fc6d60bd1ff05c

    # file: media/3/glusterfs-3.2.0qa8/libglusterfs
    trusted.gfid=0xcfeeacae15b54738b8fc6d60bd1ff05c

    # file: media/2/glusterfs-3.2.0qa8/libglusterfs
    trusted.gfid=0xcfeeacae15b54738b8fc6d60bd1ff05c

    root@odin:/mnt/distribute1# rm -rf glusterfs-3.2.0qa8/
    root@odin:/mnt/distribute1#

    The the directory can be removed from the mount point provided we have fixed the layouts of the directory and its sub-directories.

    This problem was first spotted on one of our customers and thanks to Avati for extensive debugging and figuring out the root cause and the solution.