PD has asked for the wisdom of the Perl Monks concerning the following question:

Dear all,

My quest is simple, but I am getting into dark and deep waters.

I have 2 files. Both consist of gene sequence lists. One contains whole sequences (a sequence can have 16,000 characters (bases)). The second file contains a list of sequences each with 25 bases (characters). I am trying to find matches of the little sequences inside any part of the big sequences. The first file has about 600 lines. The second has 6680 lines. Each line in both files corresponds to a sequence.

Here is the part of my code that is chocking:

while(<F1>) { # this is the "Biiiiigggg Sequence" $targetseq = $_; print "$targetseq\n"; chomp($targetseq); open(F2,"<$file2") or die "Error opening $file2: $!"; while(<F2>) { # probe sequences (25 base) substr($_,0, 25); $probe = $_; chomp($probe); if ($targetseq=~ /.*$probe.*/) { *this is Line 29 $start = index($targetseq, $probe); push(@matchregion,$start); } } $indexes = @matchregion;

I get the following error:

Unmatched ( in regex; marked by <-- HERE in /.*AGCTCAAAACTCTCAAAGAGGAGG MMUS00S00000022 AJ242777 1427689_a_at "Mus musculus mRNA for ABINs, ( <-- HERE A20-binding inhibitor of NF-kappa B activation (small).".*/ at sequence_coverage.pl line 29, <F2> line 1073.

I am trying to match the second file string to any part of the first file string. I know I am doing something wrong. Any help will be humbly accepted. Thanks!

Replies are listed 'Best First'.
Re: Opening files, comparing strings. Should be simple!?
by ikegami (Patriarch) on Jan 04, 2005 at 17:39 UTC

    $targetseq =~ /.*$probe.*/
    simplies to
    $targetseq =~ /$probe/
    but you probably want to escape the stuff in $probe (which should fix the problem), leaving you with
    $targetseq =~ /\Q$probe\E/
    which is the same thing as
    index($targetseq, $probe) != -1

    So

    if ($targetseq=~ /.*$probe.*/) { $start = index($targetseq, $probe); push(@matchregion,$start); }

    becomes

    $start = index($targetseq, $probe); push(@matchregion, $start) unless $start == -1;

    I didn't look at improving your alogirthm, just fixing the problem you mentioned (the contents of $probe being treated as a regexp).

      Thanks for your help. It works.
Re: Opening files, comparing strings. Should be simple!?
by johnnywang (Priest) on Jan 04, 2005 at 18:28 UTC
    Besides what has been pointed out, the following:
    substr($_,0,25);
    doesn't do anything unless you assign it to something else. I assume you're trying to extract the first 25 characters from each line, then you want:
    $probe = substr($_,0,25);
    Can't tell whether you're using "strict" from your post (the variable $probe indicates not).
      I had fixed this but forgot to correct it in my question. Thank you ... anyway.
Re: Opening files, comparing strings. Should be simple!?
by TedPride (Priest) on Jan 04, 2005 at 19:35 UTC
    I would also like to note that with files that small, it should be possible to load the entire contents of both into memory. Processing via arrays instead of file access should significantly speed up your find.