comment on

Hi Monkers, I am trying to extract certain ID's from a file. The file looks like this


Query= sp|P30443|1A01_HUMAN HLA class I histocompatibility antigen,
A-1 alpha chain OS=Homo sapiens GN=HLA-A PE=1 SV=1
         (365 letters)



                                                                 Score
+    E
Sequences producing significant alignments:                      (bits
+) Value

tr|G1KTN1|G1KTN1_ANOCA Uncharacterized protein OS=Anolis carolin...   
+242   1e-77
tr|L7MZX2|L7MZX2_ANOCA Uncharacterized protein OS=Anolis carolin...   
+239   2e-76
tr|H9GR57|H9GR57_ANOCA Uncharacterized protein (Fragment) OS=Ano...   
+236   4e-75
tr|L7MZP5|L7MZP5_ANOCA Uncharacterized protein OS=Anolis carolin...   
+233   3e-74
tr|H9G3Y5|H9G3Y5_ANOCA Uncharacterized protein OS=Anolis carolin...   
+231   1e-73
tr|H9GBT0|H9GBT0_ANOCA Uncharacterized protein (Fragment) OS=Ano...   
+232   2e-73
tr|H9GTB3|H9GTB3_ANOCA Uncharacterized protein (Fragment) OS=Ano...   
+220   3e-69
tr|H9GSQ9|H9GSQ9_ANOCA Uncharacterized protein OS=Anolis carolin...   
+218   2e-68
tr|L7MZR7|L7MZR7_ANOCA Uncharacterized protein (Fragment) OS=Ano...   
+213   4e-66
tr|H9GRY4|H9GRY4_ANOCA Uncharacterized protein (Fragment) OS=Ano...   
+209   2e-65
tr|H9GBL3|H9GBL3_ANOCA Uncharacterized protein OS=Anolis carolin...   
+206   5e-64
>tr|G1KTN1|G1KTN1_ANOCA Uncharacterized protein OS=Anolis
           carolinensis PE=3 SV=2
          Length = 358

 Score =  242 bits (618), Expect = 1e-77,   Method: Composition-based 
+stats.
 Identities = 131/280 (46%), Positives = 175/280 (62%), Gaps = 8/280 (
+2%)

Query: 24  AGSHSMRYFFTSVSRPGRGEPRFIAVGYVDDTQFVRFDSDAASQKMEPRAPWI---EQE
+G 80
           + SHSMRYF TSVS PG+  P+F  VGYVDD +FV ++  A++++  P+ PWI   E+ 
+ 
Sbjct: 25  SSSHSMRYFVTSVSEPGQQVPQFSYVGYVDDQEFVSYN--ASTRRYLPKVPWISKVEKN
+D 82

Query: 81  PEYWDQETRNMKAHSQTDRANLGTLRGYYNQSEDGSHTIQIMYGCDVGPDGRFLRGYRQ
+D 140
           P+YW++ T   + H ++ R +L TL  YYNQS  G HT Q MYGC++  D     GY Q
+ 
Sbjct: 83  PDYWERNTLYAQGHERSFRDHLATLAEYYNQS-GGLHTFQWMYGCELRNDWS-KGGYYQ
+Y 14


>tr|L7MZX2|L7MZX2_ANOCA Uncharacterized protein OS=Anolis
           carolinensis GN=LOC100559978 PE=3 SV=1
          Length = 364

 Score =  239 bits (611), Expect = 2e-76,   Method: Composition-based 
+stats.
 Identities = 130/274 (47%), Positives = 176/274 (64%), Gaps = 8/274 (
+2%)

Query: 30  RYFFTSVSRPGRGEPRFIAVGYVDDTQFVRFDSDAASQKMEPRAPWI---EQEGPEYWD
+Q 86
           RY +TSVS PG+ EP+F +VGYVD+ +FV +DS A  ++  P  PWI   E+E PEYW+
+Q
Sbjct: 32  RYVYTSVSEPGQQEPQFFSVGYVDEQEFVSYDSKA--KRRFPAVPWIRKVEEEDPEYWE
+Q 89
[download]

I would like to extract all the alpha numeric characters after "Query=" and the first ID that comes after the ">". The regex works fine for individual extraction of the ID after "Query=" and after ">". But I want to print the "Query=" ID and print only the first ID that comes after ">". Then the program should find the next "Query=" and so on. My code works fine for the first regex but nothing is printed in the second regex. This question may be silly, but im relatively new to perl. My code below

#!/usr/bin/perl
use warnings;
use strict;
use diagnostics;
my $file=$ARGV[0];
open (FILE,$file);
while(<FILE>)
{
my @query=$_;
foreach my $a (@query)
{
    next until $a=~/^Query=.*$/;
    if($a=~/^Query=\s([^\s]+)\s.*$/)
    {
    print "$1\t";
    next until $a=~/^>.+$/;
    if($a=~/^>([^\s]+)\s.*$/)
    {
    print "$1\n";
    }
    }    
}
}
[download]

In reply to Help with regex by rocketperl

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.