comment on

Thanks again Tye!

I incorporated your changes: bsearch() is no longer accessing global variables, output is adjusted for a 'no match' case, and Modern::Perl does turn on warnings by default.

#!/usr/bin/perl
use Modern::Perl '2011';
use autodie;

# pattern searching using suffix arrays and binary search

open my $fh, '<', shift;
chomp( my( $pattern, $str ) = <$fh> );
$str .= '$';
my @suff; # hold suffixes
while ( 1 < length $str ) { # condition excludes '$' from @suff
    push @suff, $str;
    substr ( $str, 0, 1, '' );
}

# lexically ordered suffix array
my @indices = sort { $suff[$a] cmp $suff[$b] } 0 .. $#suff;
#for ( 0 .. $#indices ) {
#    say $indices[$_]+1, ": ", $suff[ $indices[$_] ];
#}

my $start = bsearch( \ @indices, $pattern, \ @suff );
# get consecutive positions, if any,
# where pattern matches first n chars of suffix.
my @positions;
for my $index ( $start .. $#indices ) {
    last 
        if $suff[ $indices[ $index ] ] !~ /^$pattern/;
    push @positions, $indices[ $index ] + 1; # omit +1 if 0-based inde
+xing
}
# print results
if ( @positions ) { # check if a match exists at all
    say "\nPattern \"$pattern\" found at position:";
    
    for my $pos ( sort{ $a <=> $b } @positions ) {
        #chop( my $s = $suff[ $pos - 1] ); #suffix
        print "$pos ";
    }
}
else {
    say "\nPattern \"$pattern\" not found in string";
}
print "\n";

# binary search.
# find first potential match, ie first suffix after or equal to patter
+n,
# such that pattern potentially matches first n characters of suffix. 
sub bsearch {
    my ( $indref, $pat, $sufref ) = @_;
    my $mid;
    my ( $lo, $hi ) = ( 0, $#$indref );
    
    while ( 1 ) {
        $mid = int( ( $lo + $hi ) / 2 );
        return $mid if $hi == $lo;
        
        if( ( $pattern cmp $$sufref[ $$indref[ $mid ] ] ) < 0 ) {
            $hi = $mid;
        }
        else {
            $lo = $mid + 1;
        }
    }
}
[download]

I did a simple benchmark relative to what you would call a 'naive' pattern search, after checking that both give the same results, with a 9-char pattern and 8569-char string. I like the speed up.

#!/usr/bin/perl
use Modern::Perl '2011';
use autodie;
use Benchmark qw/ timethese /;

# benchmark pattern searching using naive approach vs suffix arrays 

open my $fh, '<', shift;
chomp( my( $pattern, $str ) = <$fh> );

my $naive = sub {
    my @positions;
    while ( $str =~ /(?=($pattern))/g ) {
            push @positions, $-[0];;
        }
    #say "@positions";
};

my $sufbin = sub {
    $str .= '$';
    my @suff; # hold suffixes
    while ( 1 < length $str ) { # condition excludes '$' from @suff
        push @suff, $str;
        substr ( $str, 0, 1, '' );
    }

    # lexically ordered suffix array
    my @indices = sort { $suff[$a] cmp $suff[$b] } 0 .. $#suff;
    #for ( 0 .. $#indices ) {
    #    say $indices[$_]+1, ": ", $suff[ $indices[$_] ];
    #}

    my $start = bsearch( \ @indices, $pattern, \ @suff );
    # get consecutive positions, if any,
    # where pattern matches first n chars of suffix.
    my @positions;
    for my $index ( $start .. $#indices ) {
        last 
            if $suff[ $indices[ $index ] ] !~ /^$pattern/;
        push @positions, $indices[ $index ];# + 1; # omit +1 if 0-base
+d indexing
    }

    # binary search.
    # find first potential match, ie first suffix after or equal to pa
+ttern,
    # such that pattern potentially matches first n characters of suff
+ix. 
    sub bsearch {
        my ( $indref, $pat, $sufref ) = @_;
        my $mid;
        my ( $lo, $hi ) = ( 0, $#$indref );
        
        while ( 1 ) {
            $mid = int( ( $lo + $hi ) / 2 );
            return $mid if $hi == $lo;
            
            if( ( $pattern cmp $$sufref[ $$indref[ $mid ] ] ) < 0 ) {
                $hi = $mid;
            }
            else {
                $lo = $mid + 1;
            }
        }
    }
};

timethese( -5, { 
        Suffixbinary => $sufbin,
        Naive         => $naive,
    } );
#output
abualiga:~$ ./benchmarkPatternSearch.pl patternSearchData.txt 
Benchmark: running Naive, Suffixbinary for at least 5 CPU seconds...
     Naive:  6 wallclock secs ( 5.32 usr +  0.00 sys =  5.32 CPU) @ 92
+5.75/s (n=4925)
Suffixbinary:  6 wallclock secs ( 5.33 usr +  0.00 sys =  5.33 CPU) @ 
+232758.91/s (n=1240605)
[download]

In reply to Re^4: searching for a pattern using suffix arrays (better) by abualiga
in thread searching for a pattern using suffix arrays by abualiga

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.