Hi there,

I am learning to use suffix arrays for pattern matching, by reading Gusfield's 'Algorithms on Strings, Trees, and Sequences' - p149 "7.14 APL13: Suffix arrays - more space reduction" and p151 "7.14.2 How to search for a pattern using a suffix array". Basically, once a suffix array is populated in lexical order, a binary search is used to find the smallest suffix index such that the pattern matches the first n characters of the suffix. I am starting with the simplest approach, without any advanced accelerants. Based on my understanding, I wrote the below script, and ask for your insight.

1. Am I understanding this correctly?
2. Is this the correct implementation, ie are the results accidentally correct?

I know there are plenty tools that will do what I want, but I would like to gain better understanding of what is going on. This is not homework, or an interview question, but a growing appreciation of bioinformatics.

many thanks!

#!/usr/bin/perl use Modern::Perl '2011'; use autodie; my $str = 'mississippi$'; my @suff; # hold suffixes while ( $str ) { push @suff, $str; substr ( $str, 0, 1, '' ); } # lexically ordered suffix array my @indices = sort { $suff[$a] cmp $suff[$b] } 0 .. $#suff; shift @indices; # remove $ for ( 0 .. $#indices ) { say $indices[$_]+1, ": ", $suff[ $indices[$_] ]; } my $pattern = 'issi'; # find smallest index where pattern matches first n characters of suff +ix my $start = bsearch( \@indices, $pattern ); # get consecutive positions where pattern matches first n chars of suf +fix my @positions; for my $index ( $start .. $#indices ) { push @positions, $indices[ $index ] + 1 if $suff[ $indices[ $index + ] ] =~ /^$pattern/; } # print results say "\nPattern \"$pattern\" found at (position: suffix):"; for my $pos ( sort{ $a <=> $b } @positions ) { chop( my $s = $suff[ $pos - 1] ); say "$pos: $s"; } sub bsearch { my $mid; my ( $low, $high ) = ( 0, $#indices ); while ( 1 ) { $mid = int( ( $low + $high ) / 2 ); if( $suff[ $indices[ $mid ] ] =~ /^$pattern/ ) { return $mid; } if( $pattern lt $suff[ $indices[ $mid ] ] ) { $high = $mid - 1; } elsif( $pattern gt $suff[ $indices[ $mid ] ] ) { $low = $mid + 1; } return unless $low <= $high; } return "pattern not found in string!"; } #output abualiga:~$ ./suffixArray.pl 11: i$ 8: ippi$ 5: issippi$ 2: ississippi$ 1: mississippi$ 10: pi$ 9: ppi$ 7: sippi$ 4: sissippi$ 6: ssippi$ 3: ssissippi$ Pattern "issi" found at (position: suffix): 2: ississippi 5: issippi

In reply to searching for a pattern using suffix arrays by abualiga

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.