Friday, January 26, 2007

Matz-san - buff that Ruby

Okay

When I started the new job, I was really excited about using Ruby. My boss is a quant and he really knows his stuffs. He loves to program in SAS. SAS is a very expensive and subscription based product. Any normal programmer would run away from SAS when you see the syntax. Actually, it is more like 'JCL'. [ If you don't know JCL - good for you ]. He does not care if I program in Perl, Python, or Ruby. So I decided to use Ruby.

I digress a little bit. Back to our gem.

I am going to get into trouble for saying this but Ruby is slow. Don't get me wrong - I love the language. However, sometimes, it is so slow that it drives me nuts.

For example, 'Date'. To print a 'Date' object is very expensive. VERY EXPENSIVE!.

I started to profile and poured over the profile output and found out that 'gcd' is called numerous and it is very slow.

After struggling, I wrote a "C" function to deal with it. But, I am still troubled of the fact that it is using 'Rational' number to represent 'Date'. It is beyond my brain why you want to use 'Rational' number for date.

However, Date#to_s is still killing me. Further digging into it, I found out that #to_s calls "strftime".

Guess what strftimeis doing - it recursively parse the format and prints out each individual components of 'Date'. Nothing wrong with that but you are parsing the format every single time you are executing Date#to_s.

What - parsing '%Y-%M-%D' - recursively calling it. You got to be kidding me!!!!

So one call to Date#to_s will result in scanning of "%Y-%M-%D" which translate into three recursive function calls.

Someone please hands me a gun.

So I changed it - if you don't pass in any format, it will use 'sprintf'.

class Date
alias old_strftime strftime

def strftime(fmt=nil)
if ( fmt == nil )
return sprintf("%.4d-%02d-%02d", year, mon, mday )
else
return old_strftime(fmt)
end
end
end


So now, Date#to_s is faster...

If I have time, I am going to hack 'Date' and find out why it is using 'Rational'.

rant mode off.

Having some 'tee' and smoking 'pipe'

Okay - I am digressing here. It is about Unix shell, not ruby.

Building data warehouse means you are shuffling data from one depot to another. That means a lot of scripting. [ No - I am not going to use ETL - I refuse to use the product where mouse clicking is the only mean of directing what needs to be done ]

These batch jobs usually run at night when I am getting some 'zzz' or 'vodka' :) So it has to log and bullet proof. So you want your script to automatically generate log file and store.

Wait - if I direct my output to a file, I won't see anything on the screen while running via shell. This is very annoying when you are testing.

So why don't you have some 'tee'?

Straight from our 'man', tee 'reads from standard input and writes to standard output and files'. [please google 'tee' for more info ]

Here is what I do with my script


#!/bin/bash

LOGFILE=/a/log/file

(

cmd1

cmd2

cmd3

) 2>&1 | tee -a $LOGFILE

Woo - now we are piping to tee. Basically, we are redirecting STDOUT and STDERR to 'tee'. 'tee' dutifully direct its STDIN to both $LOGFILE and STDOUT.

This is useful when you run the script via terminal. You can see while it is running but the log file is also created and saved automatically.

Alas - we think we are done - no, my friend.

There is an unfortunate side effect of using "|". STATUS of execution.

$?
You should always check $? after executing the command.

For example,

execute_some_program

if [ $? -ne 0 ]; then
echo "some error message"
exit 1
end
However, when you use "|", you only get the exit status of the last command in the chain.

For example
cmd1 | cmd2 | cmd3
$? is that of "cmd3"

So what does it got to with our 'tee' and script.

Remember all the hardworking is done before 'tee'. If there is any error, you want your script to exit with non-zero value to indicate there is a problem. However, our script will return status of 'tee' problem.

Let's translate this. Even though there is a problem and you dutifully exit with non-zero, tee will hijack the status code and replace with its own.

No, this is not a bug. $? is one variable. Unix creator has to decide which of $? needs to reported and they picked the last one.

But there is a trick. Actually, I have to confess that we said 'Unix' shell at the beginning. It is actually 'Bash'.

Bash decided that this is common and created another variable - $PIPESTATUS. This is an array "containing a list of exit status values from the processes in the most-recently-executed foreground pipeline (which may contain only a single command).

So we can improve our program by doing this.

#!/bin/bash

LOGFILE=/a/log/file

(

cmd1

cmd2

cmd3

) 2>&1 | tee -a $LOGFILE

exit ${PIPESTATUS[0]}
So now you had some tee and smoked pipe.

Enjoy.

Tuesday, January 16, 2007

Hello Everyone

I've been tinker with Ruby for past 6 months.

My interest started out with Rails on Ruby. Yes, I was caught up with Java sucks! Ruby Rulez! mantra.

I installed Ruby on my PC, Mac and Linux boxes at home and have been tinkering around it.

When I moved the job last August, I decided that Ruby is going to be main platform to do 'dataware house'.

You heard me right, 'dataware house'. Let me say it again, Ruby for database project on top of IBM DB 8.2 [ No, I refuse to call it UDB - it smells like DB2 - It quacks like DB2 - it is DB2 ]

Little did I know that there are many up and down with this crazy idea.

As I progress with my adventure - I want to thank Matz-san for his rubyism - Ruby made me drink a bucket load of sake - oh - don't get me started - Ruby is not perfect but it is nice to work with.

So sit tight, I will go through the details of Ruby and try to do some brain dump so you don't have to go through my pain.