imlou has asked for the wisdom of the Perl Monks concerning the following question:

I need to search for a specific string1 "TTGACA" in another string that is very long and there are no spaces between characters. The occurance of string1 is multiple times. The searching string needs to be in that exact order. This is what I have so far but its not giving me the correct values.
#/usr/bin/perl @ARGV = ('test.txt'); open FILE, $ARGV[0] or die "can't open file\n"; $string = ''; @file = <FILE>; foreach $line(@file){ if ($line =~ /^>/){next;} else {$string .= $line;} } #the first line is useless so need to skip in close FILE; my $search = 'TTGACA'; $string =~ s/\s//g; #getting rid of whitespaces $n = ($string =~ tr/$search//); print "String $search occured $n times\n\n";

Replies are listed 'Best First'.
Re: search for a sequence of chars in a string
by Zaxo (Archbishop) on Sep 21, 2003 at 17:40 UTC

    To find an exact match of a string, index is preferred.

    my ($here, @places) = 0; while (($here = index $string, $search, $here) != -1) { push @places, $here++; } printf "String %s occured %d times\n\n", $search, scalar @places;

    After Compline,
    Zaxo

Re: search for a sequence of chars in a string
by tachyon (Chancellor) on Sep 21, 2003 at 17:35 UTC
    #/usr/bin/perl my $search = 'TTGACA'; open FILE, $ARGV[0] or die "can't open file\n"; local $/; $data = <FILE>; $data =~ s/[^GNUCAT]//g; # or similar ;-) close FILE; my $count = () = $data =~ m/$search/g; print "String $search occured $count times\n\n";

    Oh you should get into the habit of closing your files after you open them. Also use strict and use warnings are good habits. Use string warnings and diagnostics or die Hard to justify for trivial cases like this but add another 50 lines and you will be glad you did.

    Update

    Zaxo's index method is far more memory efficient and almost certainly faster way to do this (for an exact match).

    tachyon

    s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

Re: search for a sequence of chars in a string
by blokhead (Monsignor) on Sep 21, 2003 at 17:39 UTC
    The left hand side of tr/// doesn't interpolate, and it also doesn't do what you want with multi-character strings. It's OK (and fast) for counting the number of a single type of character in a string, but not for a multi-character substring. You want:
    ## not this ## $n = ($string =~ tr/$search//); $n = () = $string =~ m/$search/g;
    Update: But if your DNA sequence is really huge, loading it all into memory is probably very inefficient. You could use something like tachyon's approach given above, although it will probably miss matches that straddle a \n.

    blokhead