How do i use regexes in one file to match FASTA sequences in another file

mind_frame has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks, Am a beginner. I have two files, the first (file1) contains several rexeges, while the other(file2) contains FASTA sequences . My intention is to use the regex in file1 to check if they match any Fasta sequences in file2 and print any regexes that match atleast one sequence, with the number of sequences they match.

file1 is structured in such a way that each line has an ID, followed by '>>', then the regex;

e.g    FGER_HWW_PRT >> ..DW[ALK]..[^P]..[VI]{2,4}
    TKAR_GLW_NQW >> [^VKR]{0,2}..FP[D].T.N.Q.

    etc...
[download]

file2 has the an idenfier on one line and the sequence on the next;

e.g     >lac9_B: details details

    GFVTSDRWPALKMSRWSLEMVWASRGYPLVNDRMWSWSDDDP
    >serP_A: otherdetails details2
    GFVLSDPPPPALKMSRWSLEMVWASRGYPLVNDPWQRTKRKRKDRTCWASNYIHDRP
[download]

Many thanks.

Comment on How do i use regexes in one file to match FASTA sequences in another file Select or Download Code

Replies are listed 'Best First'.
Re: How do i use regexes in one file to match FASTA sequences in another file by lune (Pilgrim) on Nov 22, 2013 at 14:46 UTC
Basically there is nothing special in reading regexes from a file in contrast to using predefined ones. It boils down to the question, how to represent the matches and the number of matches in an efficient way. I created some files with simplified test input to concentrate on the problem: regexes.txt `ID1>>^a ID2>>h$ ID3>>b ID4>>[a-z]{9,10} ID5>>[ah]` [download] lines.txt `id_A: abcdefg id_B: bcdefgh id_C: cdefghijk` [download] Probably you will have to make changes to the "split"-Statements to match the format of your input. I am storing the matches in a Hash that uses the regex-expressions as keys and array references of matches as values. #!/usr/bin/perl -w use strict; use autodie; open(my $regexefile, "<", "regexes.txt"); my @regexes = <$regexefile>; chomp @regexes; my %regexes = map { split(/>>/, $_) } @regexes; my %matches; open(my $inputfile, "<", "lines.txt"); while (<$inputfile>) { while (my ($id, $regex) = each(%regexes)) { my (undef, $line) = split(/ /, $_); if ( $line =~ /$regex/) { if (! defined($matches{$regex})) { $matches{$regex} = []; } chomp $line; push($matches{$regex}, $line); } } } while (my ($regex, $matches) = each(%matches)) { if (!scalar @$matches) { next; } print "$regex: No of matches " . scalar @$matches . "\n"; foreach my $match (@$matches) { print "matched $match\n"; } } [download] Update: added autodie; warnings are already active because of -w.	[reply] [d/l] [select]
Re^2: How do i use regexes in one file to match FASTA sequences in another file by Kenosis (Priest) on Nov 22, 2013 at 19:30 UTC
Nice logic to share with OP. Consider, however, adding `use warnings; use autodie;`, the latter to handle `open` errors.	[reply] [d/l] [select]
Re: How do i use regexes in one file to match FASTA sequences in another file by Random_Walk (Prior) on Nov 22, 2013 at 13:36 UTC
Please add `<code>` tags around your code, and have a quick read of Writeup Formatting Tips Cheers, R. Pereant, qui ante nos nostra dixerunt!	[reply] [d/l] [select]
Re: How do i use regexes in one file to match FASTA sequences in another file by choroba (Cardinal) on Nov 22, 2013 at 13:46 UTC
Crossposted at StackOverflow. It is considered polite to inform about crossposting, so the hackers not attending both sites do not waste their time solving an issue already closed at the other end of the internets. لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply]
Re: How do i use regexes in one file to match FASTA sequences in another file by Random_Walk (Prior) on Nov 22, 2013 at 13:36 UTC
Please add `<code>` tags around your code Cheers, R. Pereant, qui ante nos nostra dixerunt!	[reply] [d/l] [select]