rocketperl has asked for the wisdom of the Perl Monks concerning the following question:
I would like to extract all the alpha numeric characters after "Query=" and the first ID that comes after the ">". The regex works fine for individual extraction of the ID after "Query=" and after ">". But I want to print the "Query=" ID and print only the first ID that comes after ">". Then the program should find the next "Query=" and so on. My code works fine for the first regex but nothing is printed in the second regex. This question may be silly, but im relatively new to perl. My code belowQuery= sp|P30443|1A01_HUMAN HLA class I histocompatibility antigen, A-1 alpha chain OS=Homo sapiens GN=HLA-A PE=1 SV=1 (365 letters) Score + E Sequences producing significant alignments: (bits +) Value tr|G1KTN1|G1KTN1_ANOCA Uncharacterized protein OS=Anolis carolin... +242 1e-77 tr|L7MZX2|L7MZX2_ANOCA Uncharacterized protein OS=Anolis carolin... +239 2e-76 tr|H9GR57|H9GR57_ANOCA Uncharacterized protein (Fragment) OS=Ano... +236 4e-75 tr|L7MZP5|L7MZP5_ANOCA Uncharacterized protein OS=Anolis carolin... +233 3e-74 tr|H9G3Y5|H9G3Y5_ANOCA Uncharacterized protein OS=Anolis carolin... +231 1e-73 tr|H9GBT0|H9GBT0_ANOCA Uncharacterized protein (Fragment) OS=Ano... +232 2e-73 tr|H9GTB3|H9GTB3_ANOCA Uncharacterized protein (Fragment) OS=Ano... +220 3e-69 tr|H9GSQ9|H9GSQ9_ANOCA Uncharacterized protein OS=Anolis carolin... +218 2e-68 tr|L7MZR7|L7MZR7_ANOCA Uncharacterized protein (Fragment) OS=Ano... +213 4e-66 tr|H9GRY4|H9GRY4_ANOCA Uncharacterized protein (Fragment) OS=Ano... +209 2e-65 tr|H9GBL3|H9GBL3_ANOCA Uncharacterized protein OS=Anolis carolin... +206 5e-64 >tr|G1KTN1|G1KTN1_ANOCA Uncharacterized protein OS=Anolis carolinensis PE=3 SV=2 Length = 358 Score = 242 bits (618), Expect = 1e-77, Method: Composition-based +stats. Identities = 131/280 (46%), Positives = 175/280 (62%), Gaps = 8/280 ( +2%) Query: 24 AGSHSMRYFFTSVSRPGRGEPRFIAVGYVDDTQFVRFDSDAASQKMEPRAPWI---EQE +G 80 + SHSMRYF TSVS PG+ P+F VGYVDD +FV ++ A++++ P+ PWI E+ + Sbjct: 25 SSSHSMRYFVTSVSEPGQQVPQFSYVGYVDDQEFVSYN--ASTRRYLPKVPWISKVEKN +D 82 Query: 81 PEYWDQETRNMKAHSQTDRANLGTLRGYYNQSEDGSHTIQIMYGCDVGPDGRFLRGYRQ +D 140 P+YW++ T + H ++ R +L TL YYNQS G HT Q MYGC++ D GY Q + Sbjct: 83 PDYWERNTLYAQGHERSFRDHLATLAEYYNQS-GGLHTFQWMYGCELRNDWS-KGGYYQ +Y 14 >tr|L7MZX2|L7MZX2_ANOCA Uncharacterized protein OS=Anolis carolinensis GN=LOC100559978 PE=3 SV=1 Length = 364 Score = 239 bits (611), Expect = 2e-76, Method: Composition-based +stats. Identities = 130/274 (47%), Positives = 176/274 (64%), Gaps = 8/274 ( +2%) Query: 30 RYFFTSVSRPGRGEPRFIAVGYVDDTQFVRFDSDAASQKMEPRAPWI---EQEGPEYWD +Q 86 RY +TSVS PG+ EP+F +VGYVD+ +FV +DS A ++ P PWI E+E PEYW+ +Q Sbjct: 32 RYVYTSVSEPGQQEPQFFSVGYVDEQEFVSYDSKA--KRRFPAVPWIRKVEEEDPEYWE +Q 89
#!/usr/bin/perl use warnings; use strict; use diagnostics; my $file=$ARGV[0]; open (FILE,$file); while(<FILE>) { my @query=$_; foreach my $a (@query) { next until $a=~/^Query=.*$/; if($a=~/^Query=\s([^\s]+)\s.*$/) { print "$1\t"; next until $a=~/^>.+$/; if($a=~/^>([^\s]+)\s.*$/) { print "$1\n"; } } } }
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Help with regex
by kcott (Archbishop) on May 07, 2014 at 15:39 UTC | |
|
Re: Help with regex
by InfiniteSilence (Curate) on May 07, 2014 at 15:06 UTC | |
|
Re: Help with regex
by AppleFritter (Vicar) on May 07, 2014 at 10:20 UTC | |
|
Re: Help with regex
by trizen (Hermit) on May 07, 2014 at 10:18 UTC | |
by rocketperl (Sexton) on May 08, 2014 at 13:42 UTC |