Friday, March 30, 2007

Access Lotus Notes using Perl via Win32::OLE

Are you a Lotus Notes user? Did you ever wonder how you can extract text out of Lotus Notes using Perl? Here's a script that I wrote to extract notification emails:

use Win32::OLE;

my $server;
my $database;
my $folder;

my $file = "temp_file.csv";

my $ini = "get_emails.ini";
if (-e $ini)
{
    open (INI,$ini) || die "Not able to open $ini: $!\n";
    chomp ($server = );
    chomp ($database = );
    chomp ($folder = );
}
else
{
    print "\nEnter Notes Server: ";
    chomp ($server=);
    print "\nEnter Notes Database: ";
    chomp ($database=);
    print "\nEnter Folder you want to access: ";
    chomp ($folder=);
}

#connect to the Notes database
my $Notes = Win32::OLE->new('Notes.NotesSession') || warn "Cannot start Lotus Notes Session object: $!\n";
my $Database = $Notes->GetDatabase($server, $database);

#Fetch contents of the folder
my $Response = $Database->GetView($folder);
my $Count = $Response->TopLevelEntryCount;
my $Index = $Count;

open (OUT, ">$file");

#loop through all emails
for (1..$Count)
{
    my $Document = $Response->GetNthDocument($Index--);
    my $subject = $Document->GetFirstItem('Subject')->{Text};
    my $body = $Document->GetFirstItem('Body')->{Text};
    print OUT "Subject: $subject\n",
              "Body: $body\n";
}

`start excel.exe $file`;

This works if you have Lotus Notes Client installed on your machine. As you can see the script uses OLE to access your mail database via the client. There is a way to connect to any Lotus Notes database without having to install Lotus Notes. It can be accessed using the ever popular Perl DBI module. The latter is the one that I use to synchronize Lotus Notes databases to Oracle, MySQL and SQL Server. Will discuss more about that... stay tuned.

Wednesday, March 07, 2007

Notes on Business Intelligence Solution

These are some practical tips taken from this book - Practical Business Intelligence with SQL Server 2005.

Suggested Approach: Build a consistent relational data warehouse with a dimensional schema optimized for queries.

Overall Solution Diagram:

Choose between Star or Snowflake for a Dimension? Answer: It Depends

  • Choose snowflake when the dimension's attributes come from different sources.
  • Snowflake when the dimension has a strong natural hierarchy. It is easier to manage the ETL process in this case.
Use a surrogate key for every dimension table. Surrogate keys are used as primary identifiers for all dimension tables in the data warehouse, and every fact table record that refers to a dimension always uses the surrogate key rather than the business key(primary key). All relationships in the data warehouse use the surrogate key, including the relationships between different dimension tables in a snowflake structure. Because the data warehouse uses surrogate keys and the source systems use business keys, this means that one important step in the ETL process is to translate the business keys in the incoming transaction records into data warehouse surrogate keys before inserting the new fact records.

Making a List of Candidate Attributes and Dimensions When you are reviewing the information you have collected, look for terms that represent different ways of looking at data. A useful rule of thumb is to look for words such as by(as in, "I need to see profitability by product category"). If you keep a list of all these candidate attributes when you find them, you can start to group them into probable dimensions such as Product or Customer.

One thing to be careful of is synonyms: People often have many different ways of naming the same thing, and it is rare that everyone will agree on the definition of every term. Similarly, people in different parts of the business could be using the same term to mean different things. An important job during the modeling process is to identify these synonyms and imprecise names and to drive the business users toward consensus on what terms will mean in the data warehouse. A useful by-product of this process can be a data dictionary that documents these decisions as they are made.

Making a List of Candidate Measures At the same time that you are recording the attributes that you have found, you will be looking for numeric measures. Many of the candidate measures that you find will turn out to be derived from a smaller set of basic measures, but you can keep track of all them because they might turn out to be useful calculations that you can add into the OLAP cube later. The best candidates for measures are additive and atomic. That is, they can be added up across all the dimensions, including time, and they are not composed from other measures.

Grouping the Measures with the Same Grain into Fact Tables

Figuring out how to group measures into fact tables is a much more structured process than grouping related attributes into dimension tables. The key concept that determines what measures end up on a fact table is that every fact table has only one grain. After you have your list of candidate measures, you can set up a spreadsheet as shown in the Table below with the candidate dimensions on the columns and the candidate measures on the rows.

Product Customer Date
Sales Amount SKU Customer Day
Quantity SKU Customer Day
Budget Amount Category N/A Month

For each measure, you need to figure out the grain or level of detail you have available. For example, for a specific sales amount from a sales transaction, you can figure out the customer that it was sold to, the product SKU that they bought, and the day that they made the purchase, so the granularity of the sales amount measure is Product SKU by Customer by Day. For budget amount, the business is only producing monthly budgets for each product category, so the granularity is Product Category by Month.

From the example in Table, we end up with two different fact tables. Because the Sales Amount and Quantity measures both have the same granularity, they will be on the Sales fact table, which will also include Product, Customer, and Date dimension keys. A separate Budget fact table will have Product Category and Date dimension keys and a Budget Amount measure.

Identifying the granularity of every measure sounds simple in principle but often turns out to be difficult in practice. So, how do you know when you have made a mistake? One common sign is when you end up with some records in the fact table with values for one set of measures and nulls for the remainder. Depending on how you load the data, you could also see that a given numeric quantity ends up being repeated on multiple records on a fact table. This usually occurs when you have a measure with a higher granularity (such as Product Category rather than Product SKU) than the fact table.

High Level Architecture

Most important dimensional modeling lesson: A single fact table must never contain measures at different levels of granularity.

Monday, March 05, 2007

Baby Ivana Visits Us!

www.flickr.com
This is a Flickr badge showing photos in a set called Ivana & Ara - Mar 2007. Make your own badge here.
Cute little baby Ivana. Our friends' first baby. Surely a delight to be visited. My daughter doesn't know how to react at first since she's always the youngest one in the house. Ivana's really a sweetie, she smiles in a whim, she likes my voice and laughs at my talking. She's just adorable!

Powerful Regular Expressions in Perl

Perl is rich with regular expressions. Most programs written in perl would normally have at least one regex for pattern matching. Having regular expressions actually ingrained directly into the language's core as well as into the minds of perl programmers, most tasks involving string manipulations are easily done in perl.

Well, last week I had an opportunity to answer one interesting question posted on ActivePerl forum. The question apparently has something to do with parsing a DNA sequence.

Here is the question:

Given the string $temp= "XXXXAAAZZZZBBBSSSSCCCGGGGBBBVVVVVBBB"

Write a regex for filtering out the string between...
AAA
BBB
CCC
so in the above case, the output should be:
AAAZZZZZBBB
BBBSSSSSSCCC
CCCGGGGBBB
BBBVVVVVBBB
All combinations of start and end for AAA BBB CCC.

My solution:

#!/usr/bin/perl

use strict;
use warnings;

my $dna_sequence  = 'XXXXAAAZZZZBBBSSSSCCCGGGGBBBVVVVVBBB';
my @dna_tags      = ('AAA',
                     'BBB',
                     'CCC',
                    );
my $joined_tags   = join "|", @dna_tags;
my $tag_pattern   = qr($joined_tags); #regexp quoting mechanism
foreach ( $dna_sequence =~ /(?=($tag_pattern.*?$tag_pattern))/g ) {
    #do whatever on the captured string
    #in this case, I just want to print it out
    print $_, "\n";
}

The trick here is in the lookaround construct. Using positive lookahead specified with the special sequence (?=...), in the above snippet, (?=($tag_pattern.*?$tag_pattern)). Parentheses inside the construct grab the matched string. This is what is called zero-width look-ahead, since at this point, the match engine has not advanced at all. But because of the /g modifier, the engine notices that it has not moved forward so it advances one character position, then does the same match until it reaches the end of the string. So in this case, the effect is overlapping matches.

There is an excellent article about this same regular expressions constructs that can be found in perl.com in this post.

I would recommend reading these two books: Mastering Regular Expressions and Perl Cookbook. Perl Cookbook is actually where I got the basic solution on this problem.