Help with regex

rocketperl has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monkers, I am trying to extract certain ID's from a file. The file looks like this


Query= sp|P30443|1A01_HUMAN HLA class I histocompatibility antigen,
A-1 alpha chain OS=Homo sapiens GN=HLA-A PE=1 SV=1
         (365 letters)



                                                                 Score
+    E
Sequences producing significant alignments:                      (bits
+) Value

tr|G1KTN1|G1KTN1_ANOCA Uncharacterized protein OS=Anolis carolin...   
+242   1e-77
tr|L7MZX2|L7MZX2_ANOCA Uncharacterized protein OS=Anolis carolin...   
+239   2e-76
tr|H9GR57|H9GR57_ANOCA Uncharacterized protein (Fragment) OS=Ano...   
+236   4e-75
tr|L7MZP5|L7MZP5_ANOCA Uncharacterized protein OS=Anolis carolin...   
+233   3e-74
tr|H9G3Y5|H9G3Y5_ANOCA Uncharacterized protein OS=Anolis carolin...   
+231   1e-73
tr|H9GBT0|H9GBT0_ANOCA Uncharacterized protein (Fragment) OS=Ano...   
+232   2e-73
tr|H9GTB3|H9GTB3_ANOCA Uncharacterized protein (Fragment) OS=Ano...   
+220   3e-69
tr|H9GSQ9|H9GSQ9_ANOCA Uncharacterized protein OS=Anolis carolin...   
+218   2e-68
tr|L7MZR7|L7MZR7_ANOCA Uncharacterized protein (Fragment) OS=Ano...   
+213   4e-66
tr|H9GRY4|H9GRY4_ANOCA Uncharacterized protein (Fragment) OS=Ano...   
+209   2e-65
tr|H9GBL3|H9GBL3_ANOCA Uncharacterized protein OS=Anolis carolin...   
+206   5e-64
>tr|G1KTN1|G1KTN1_ANOCA Uncharacterized protein OS=Anolis
           carolinensis PE=3 SV=2
          Length = 358

 Score =  242 bits (618), Expect = 1e-77,   Method: Composition-based 
+stats.
 Identities = 131/280 (46%), Positives = 175/280 (62%), Gaps = 8/280 (
+2%)

Query: 24  AGSHSMRYFFTSVSRPGRGEPRFIAVGYVDDTQFVRFDSDAASQKMEPRAPWI---EQE
+G 80
           + SHSMRYF TSVS PG+  P+F  VGYVDD +FV ++  A++++  P+ PWI   E+ 
+ 
Sbjct: 25  SSSHSMRYFVTSVSEPGQQVPQFSYVGYVDDQEFVSYN--ASTRRYLPKVPWISKVEKN
+D 82

Query: 81  PEYWDQETRNMKAHSQTDRANLGTLRGYYNQSEDGSHTIQIMYGCDVGPDGRFLRGYRQ
+D 140
           P+YW++ T   + H ++ R +L TL  YYNQS  G HT Q MYGC++  D     GY Q
+ 
Sbjct: 83  PDYWERNTLYAQGHERSFRDHLATLAEYYNQS-GGLHTFQWMYGCELRNDWS-KGGYYQ
+Y 14


>tr|L7MZX2|L7MZX2_ANOCA Uncharacterized protein OS=Anolis
           carolinensis GN=LOC100559978 PE=3 SV=1
          Length = 364

 Score =  239 bits (611), Expect = 2e-76,   Method: Composition-based 
+stats.
 Identities = 130/274 (47%), Positives = 176/274 (64%), Gaps = 8/274 (
+2%)

Query: 30  RYFFTSVSRPGRGEPRFIAVGYVDDTQFVRFDSDAASQKMEPRAPWI---EQEGPEYWD
+Q 86
           RY +TSVS PG+ EP+F +VGYVD+ +FV +DS A  ++  P  PWI   E+E PEYW+
+Q
Sbjct: 32  RYVYTSVSEPGQQEPQFFSVGYVDEQEFVSYDSKA--KRRFPAVPWIRKVEEEDPEYWE
+Q 89
[download]

I would like to extract all the alpha numeric characters after "Query=" and the first ID that comes after the ">". The regex works fine for individual extraction of the ID after "Query=" and after ">". But I want to print the "Query=" ID and print only the first ID that comes after ">". Then the program should find the next "Query=" and so on. My code works fine for the first regex but nothing is printed in the second regex. This question may be silly, but im relatively new to perl. My code below

#!/usr/bin/perl
use warnings;
use strict;
use diagnostics;
my $file=$ARGV[0];
open (FILE,$file);
while(<FILE>)
{
my @query=$_;
foreach my $a (@query)
{
    next until $a=~/^Query=.*$/;
    if($a=~/^Query=\s([^\s]+)\s.*$/)
    {
    print "$1\t";
    next until $a=~/^>.+$/;
    if($a=~/^>([^\s]+)\s.*$/)
    {
    print "$1\n";
    }
    }    
}
}
[download]

Comment on Help with regex Select or Download Code

Replies are listed 'Best First'.
Re: Help with regex by kcott (Archbishop) on May 07, 2014 at 15:39 UTC
G'day rocketperl, You need to be more precise about what you want: your text says 'extract all the alpha numeric characters after "Query="'; your code says 'extract all non-whitespace characters after "Query= "'. In the code below, I've guessed the code is correct. Here's the basic code you need to achieve what you want: `{ local $/ = 'Query= '; while (<$filehandle>) { print "$1\t$2" if / \A (\S+) .*? ^ > (\S+) /msx; } }` [download] I suggest you read the open documentation for a better way to open your files (i.e. 3-argument form with a lexical filehandle). You should also check that your I/O has worked: either hand-craft messages as shown in the open documentation or, and this is my general preference, let Perl do this for you by using the autodie pragma. My test code, test data and output is in the spoiler: <Reveal this spoiler or all in this thread> -- Ken	[reply] [d/l] [select]
Re: Help with regex by InfiniteSilence (Curate) on May 07, 2014 at 15:06 UTC
A couple of previous responses seemed to interpret your requirements as looking for anything that is not white space that appears after your 'Query=' but that is not how I read it. An alphanumeric character means that you only want alphabetical and/or numeric characters. A pipe '\|' doesn't fit that rule. `perl -ne 'if(m/Query\=\s+(\S+)/){print $1};' monks1085293.data sp\|P30443\|1A01_HUMAN` [download] Pipes... `perl -ne 'if(m/Query\=\s+([^\s]+)/){print $1};' monks1085293.data sp\|P30443\|1A01_HUMAN` [download] Pipes... `perl -ne 'if(m/(?:Query\=\|\>)\s?([A-Za-z0-9]+)/){print qq\|$1\t\|};' mon +ks1085293.data sp tr tr` [download] Now this selects exactly what you are looking for...alphanumeric characters that follow either 'Query=' or '>'. Celebrate Intellectual Diversity	[reply] [d/l] [select]
Re: Help with regex by AppleFritter (Vicar) on May 07, 2014 at 10:20 UTC
Here's how I'd rewrite your program: `#!/usr/bin/perl use warnings; use strict; use diagnostics; my $found_query = 0; while(<>) { if($found_query) { m/^>([^\s]+)/ and do { print "$1\n"; $found_query = 0; } } else { m/^Query=\s([^\s]+)/ and do { print "$1\t"; $found_query = 1; } } }` [download] Does this do what you want?	[reply] [d/l]
Re: Help with regex by trizen (Hermit) on May 07, 2014 at 10:18 UTC
Try this: `#!/usr/bin/perl use strict; use warnings; use diagnostics; my $file = $ARGV[0]; open(my $fh, '<', $file) or die "$file: $!"; my $query; while (<$fh>) { if (/^Query=\s(\S+)/) { $query = $1; } if (defined($query) && /^>(\S+)/) { print "$query\t$1\n"; undef $query; } }` [download]	[reply] [d/l]
Re^2: Help with regex by rocketperl (Sexton) on May 08, 2014 at 13:42 UTC
Thanks a lot! worked like a charm!	[reply]