mind_frame has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks, Am a beginner. I have two files, the first (file1) contains several rexeges, while the other(file2) contains FASTA sequences . My intention is to use the regex in file1 to check if they match any Fasta sequences in file2 and print any regexes that match atleast one sequence, with the number of sequences they match.

file1 is structured in such a way that each line has an ID, followed by '>>', then the regex;
e.g FGER_HWW_PRT >> ..DW[ALK]..[^P]..[VI]{2,4} TKAR_GLW_NQW >> [^VKR]{0,2}..FP[D].T.N.Q. etc...
file2 has the an idenfier on one line and the sequence on the next;
e.g >lac9_B: details details GFVTSDRWPALKMSRWSLEMVWASRGYPLVNDRMWSWSDDDP >serP_A: otherdetails details2 GFVLSDPPPPALKMSRWSLEMVWASRGYPLVNDPWQRTKRKRKDRTCWASNYIHDRP
Many thanks.

Replies are listed 'Best First'.
Re: How do i use regexes in one file to match FASTA sequences in another file
by lune (Pilgrim) on Nov 22, 2013 at 14:46 UTC
    Basically there is nothing special in reading regexes from a file in contrast to using predefined ones.

    It boils down to the question, how to represent the matches and the number of matches in an efficient way.

    I created some files with simplified test input to concentrate on the problem: regexes.txt

    ID1>>^a ID2>>h$ ID3>>b ID4>>[a-z]{9,10} ID5>>[ah]
    lines.txt
    id_A: abcdefg id_B: bcdefgh id_C: cdefghijk
    Probably you will have to make changes to the "split"-Statements to match the format of your input.

    I am storing the matches in a Hash that uses the regex-expressions as keys and array references of matches as values.

    #!/usr/bin/perl -w use strict; use autodie; open(my $regexefile, "<", "regexes.txt"); my @regexes = <$regexefile>; chomp @regexes; my %regexes = map { split(/>>/, $_) } @regexes; my %matches; open(my $inputfile, "<", "lines.txt"); while (<$inputfile>) { while (my ($id, $regex) = each(%regexes)) { my (undef, $line) = split(/ /, $_); if ( $line =~ /$regex/) { if (! defined($matches{$regex})) { $matches{$regex} = []; } chomp $line; push($matches{$regex}, $line); } } } while (my ($regex, $matches) = each(%matches)) { if (!scalar @$matches) { next; } print "$regex: No of matches " . scalar @$matches . "\n"; foreach my $match (@$matches) { print "matched $match\n"; } }

    Update: added autodie; warnings are already active because of -w.

      Nice logic to share with OP. Consider, however, adding use warnings; use autodie;, the latter to handle open errors.

Re: How do i use regexes in one file to match FASTA sequences in another file
by Random_Walk (Prior) on Nov 22, 2013 at 13:36 UTC

    Please add <code> tags around your code, and have a quick read of Writeup Formatting Tips

    Cheers,
    R.

    Pereant, qui ante nos nostra dixerunt!
Re: How do i use regexes in one file to match FASTA sequences in another file
by choroba (Cardinal) on Nov 22, 2013 at 13:46 UTC
    Crossposted at StackOverflow. It is considered polite to inform about crossposting, so the hackers not attending both sites do not waste their time solving an issue already closed at the other end of the internets.
    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
Re: How do i use regexes in one file to match FASTA sequences in another file
by Random_Walk (Prior) on Nov 22, 2013 at 13:36 UTC

    Please add <code> tags around your code

    Cheers,
    R.

    Pereant, qui ante nos nostra dixerunt!