Re: Re: Re: Progressive pattern matching

To clarify what I think that you want, let me construct some examples :

Input string :
GATTACA
File :
ATTACGATTACAAA GATT ZZGATTZZ asdghckasdlkj TTACA Output :

On line 1 :
  GATTACA,GATTAC,ATTACA,TTACA,GATTA,ATTAC,TTAC,TACA,
  GATT,ATTA,TTA,TAC,GAT,ATT,ACA,TT,TA,GA,CA,AT,AC,T,G,C,A
On line 2 :
  GATT,GAT,ATT,TT,GA,AT,T,G,A
On line 3 :
  GATTA,GATT,ATTA,TTA,GAT,ATT,TT,TA,GA,AT,T,G,A
On line 4 :
  GATTACA,GATTAC,ATTACA,TTACA,GATTA,ATTAC,TTAC,TACA,
  GATT,ATTA,TTA,TAC,
GAT,ATT,ACA,TT,TA,GA,CA,AT,AC,T,G,C,A
[download]

To achieve this, you want to find the longest substring of the input string that is found on a line of the file, for the various substrings that match until the end of the last character of the search string. To show you a first approach which is surely suboptimal, look at the following code which tries a brute force approach :

use strict;

my $searchString = "GATTACA";
my %subStrings = {};
my @subStrings = ();

sub populate {
  # Fills the hash subStrings with all "allowed" substrings
  # of the argument. Duplicates are avoided by
  # filling a hash instead of an array.
  my ($string) = @_;
  
  return if $string eq "";
  
  my $line = "";
  
  foreach (split "", $string) {
    $line .= $_;
    #print "Added $line\n";
    $subStrings{$line} = "1";
  };
  
  populate( substr( $string, 1 ));
};

populate( $searchString );

# We are only interested in the keys of our hash, 
# longest matches first :
@subStrings = reverse 
              sort { 
               length($a) <=> length($b) # Sort by string length
            || $a cmp $b                 # and then by string content
             } keys %subStrings;

# We read the file line by line :
my ( $line, $substring );
while ($line = <DATA>) {
  my @MatchedSubstrings = ();
  foreach $substring (@subStrings) {
    if ($line =~ /$substring/) {
      push @MatchedSubstrings, $substring;
    };    
  };
  if ($#MatchedSubstrings != -1) {
    print "On line $. : ", join(",", @MatchedSubstrings ),"\n";
  };
};

__DATA__
AGATTACAAA
ZZGATTZZ
GATTAZZ
GATGATTACAZZ
asdfgh
gattaca
[download]

Note that there already are many Perl modules for Bioinformatics, a search of the CPAN (http://www.cpan.org) should give you interesting results, as should a Google search for Perl and DNA I guess.

perl -MHTTP::Daemon -MHTTP::Response -MLWP::Simple -e ' ;    # The  
$d = new HTTP::Daemon and fork and getprint $d->url and exit;#spider
($c = $d->accept())->get_request(); $c->send_response( new   #in the
HTTP::Response(200,$_,$_,qq(Just another Perl hacker\n))); ' #  web
[download]

Comment on Re: Re: Re: Progressive pattern matching Select or Download Code

Replies are listed 'Best First'.
Re: Re: Re: Re: Progressive pattern matching by tfrayner (Curate) on Oct 15, 2001 at 19:35 UTC
I don't know whether it has the precise methods required, but see bioperl.org for the Bio::Perl homepage. I would have checked myself, but I was too busy reinventing the wheel (maybe), below :-) Tim	[reply]