Monday, March 05, 2007

Powerful Regular Expressions in Perl

Perl is rich with regular expressions. Most programs written in perl would normally have at least one regex for pattern matching. Having regular expressions actually ingrained directly into the language's core as well as into the minds of perl programmers, most tasks involving string manipulations are easily done in perl.

Well, last week I had an opportunity to answer one interesting question posted on ActivePerl forum. The question apparently has something to do with parsing a DNA sequence.

Here is the question:

Given the string $temp= "XXXXAAAZZZZBBBSSSSCCCGGGGBBBVVVVVBBB"

Write a regex for filtering out the string between...
AAA
BBB
CCC
so in the above case, the output should be:
AAAZZZZZBBB
BBBSSSSSSCCC
CCCGGGGBBB
BBBVVVVVBBB
All combinations of start and end for AAA BBB CCC.

My solution:

#!/usr/bin/perl

use strict;
use warnings;

my $dna_sequence  = 'XXXXAAAZZZZBBBSSSSCCCGGGGBBBVVVVVBBB';
my @dna_tags      = ('AAA',
                     'BBB',
                     'CCC',
                    );
my $joined_tags   = join "|", @dna_tags;
my $tag_pattern   = qr($joined_tags); #regexp quoting mechanism
foreach ( $dna_sequence =~ /(?=($tag_pattern.*?$tag_pattern))/g ) {
    #do whatever on the captured string
    #in this case, I just want to print it out
    print $_, "\n";
}

The trick here is in the lookaround construct. Using positive lookahead specified with the special sequence (?=...), in the above snippet, (?=($tag_pattern.*?$tag_pattern)). Parentheses inside the construct grab the matched string. This is what is called zero-width look-ahead, since at this point, the match engine has not advanced at all. But because of the /g modifier, the engine notices that it has not moved forward so it advances one character position, then does the same match until it reaches the end of the string. So in this case, the effect is overlapping matches.

There is an excellent article about this same regular expressions constructs that can be found in perl.com in this post.

I would recommend reading these two books: Mastering Regular Expressions and Perl Cookbook. Perl Cookbook is actually where I got the basic solution on this problem.

0 comments: