rocketperl has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monkers, I am trying to extract certain ID's from a file. The file looks like this
Query= sp|P30443|1A01_HUMAN HLA class I histocompatibility antigen, A-1 alpha chain OS=Homo sapiens GN=HLA-A PE=1 SV=1 (365 letters) Score + E Sequences producing significant alignments: (bits +) Value tr|G1KTN1|G1KTN1_ANOCA Uncharacterized protein OS=Anolis carolin... +242 1e-77 tr|L7MZX2|L7MZX2_ANOCA Uncharacterized protein OS=Anolis carolin... +239 2e-76 tr|H9GR57|H9GR57_ANOCA Uncharacterized protein (Fragment) OS=Ano... +236 4e-75 tr|L7MZP5|L7MZP5_ANOCA Uncharacterized protein OS=Anolis carolin... +233 3e-74 tr|H9G3Y5|H9G3Y5_ANOCA Uncharacterized protein OS=Anolis carolin... +231 1e-73 tr|H9GBT0|H9GBT0_ANOCA Uncharacterized protein (Fragment) OS=Ano... +232 2e-73 tr|H9GTB3|H9GTB3_ANOCA Uncharacterized protein (Fragment) OS=Ano... +220 3e-69 tr|H9GSQ9|H9GSQ9_ANOCA Uncharacterized protein OS=Anolis carolin... +218 2e-68 tr|L7MZR7|L7MZR7_ANOCA Uncharacterized protein (Fragment) OS=Ano... +213 4e-66 tr|H9GRY4|H9GRY4_ANOCA Uncharacterized protein (Fragment) OS=Ano... +209 2e-65 tr|H9GBL3|H9GBL3_ANOCA Uncharacterized protein OS=Anolis carolin... +206 5e-64 >tr|G1KTN1|G1KTN1_ANOCA Uncharacterized protein OS=Anolis carolinensis PE=3 SV=2 Length = 358 Score = 242 bits (618), Expect = 1e-77, Method: Composition-based +stats. Identities = 131/280 (46%), Positives = 175/280 (62%), Gaps = 8/280 ( +2%) Query: 24 AGSHSMRYFFTSVSRPGRGEPRFIAVGYVDDTQFVRFDSDAASQKMEPRAPWI---EQE +G 80 + SHSMRYF TSVS PG+ P+F VGYVDD +FV ++ A++++ P+ PWI E+ + Sbjct: 25 SSSHSMRYFVTSVSEPGQQVPQFSYVGYVDDQEFVSYN--ASTRRYLPKVPWISKVEKN +D 82 Query: 81 PEYWDQETRNMKAHSQTDRANLGTLRGYYNQSEDGSHTIQIMYGCDVGPDGRFLRGYRQ +D 140 P+YW++ T + H ++ R +L TL YYNQS G HT Q MYGC++ D GY Q + Sbjct: 83 PDYWERNTLYAQGHERSFRDHLATLAEYYNQS-GGLHTFQWMYGCELRNDWS-KGGYYQ +Y 14 >tr|L7MZX2|L7MZX2_ANOCA Uncharacterized protein OS=Anolis carolinensis GN=LOC100559978 PE=3 SV=1 Length = 364 Score = 239 bits (611), Expect = 2e-76, Method: Composition-based +stats. Identities = 130/274 (47%), Positives = 176/274 (64%), Gaps = 8/274 ( +2%) Query: 30 RYFFTSVSRPGRGEPRFIAVGYVDDTQFVRFDSDAASQKMEPRAPWI---EQEGPEYWD +Q 86 RY +TSVS PG+ EP+F +VGYVD+ +FV +DS A ++ P PWI E+E PEYW+ +Q Sbjct: 32 RYVYTSVSEPGQQEPQFFSVGYVDEQEFVSYDSKA--KRRFPAVPWIRKVEEEDPEYWE +Q 89
I would like to extract all the alpha numeric characters after "Query=" and the first ID that comes after the ">". The regex works fine for individual extraction of the ID after "Query=" and after ">". But I want to print the "Query=" ID and print only the first ID that comes after ">". Then the program should find the next "Query=" and so on. My code works fine for the first regex but nothing is printed in the second regex. This question may be silly, but im relatively new to perl. My code below
#!/usr/bin/perl use warnings; use strict; use diagnostics; my $file=$ARGV[0]; open (FILE,$file); while(<FILE>) { my @query=$_; foreach my $a (@query) { next until $a=~/^Query=.*$/; if($a=~/^Query=\s([^\s]+)\s.*$/) { print "$1\t"; next until $a=~/^>.+$/; if($a=~/^>([^\s]+)\s.*$/) { print "$1\n"; } } } }

Replies are listed 'Best First'.
Re: Help with regex
by kcott (Archbishop) on May 07, 2014 at 15:39 UTC

    G'day rocketperl,

    You need to be more precise about what you want: your text says 'extract all the alpha numeric characters after "Query="'; your code says 'extract all non-whitespace characters after "Query= "'. In the code below, I've guessed the code is correct.

    Here's the basic code you need to achieve what you want:

    { local $/ = 'Query= '; while (<$filehandle>) { print "$1\t$2" if / \A (\S+) .*? ^ > (\S+) /msx; } }

    I suggest you read the open documentation for a better way to open your files (i.e. 3-argument form with a lexical filehandle). You should also check that your I/O has worked: either hand-craft messages as shown in the open documentation or, and this is my general preference, let Perl do this for you by using the autodie pragma.

    My test code, test data and output is in the spoiler:

    -- Ken

Re: Help with regex
by InfiniteSilence (Curate) on May 07, 2014 at 15:06 UTC

    A couple of previous responses seemed to interpret your requirements as looking for anything that is not white space that appears after your 'Query=' but that is not how I read it. An alphanumeric character means that you only want alphabetical and/or numeric characters. A pipe '|' doesn't fit that rule.

    perl -ne 'if(m/Query\=\s+(\S+)/){print $1};' monks1085293.data sp|P30443|1A01_HUMAN
    Pipes...
    perl -ne 'if(m/Query\=\s+([^\s]+)/){print $1};' monks1085293.data sp|P30443|1A01_HUMAN
    Pipes...
    perl -ne 'if(m/(?:Query\=|\>)\s?([A-Za-z0-9]+)/){print qq|$1\t|};' mon +ks1085293.data sp tr tr

    Now this selects exactly what you are looking for...alphanumeric characters that follow either 'Query=' or '>'.

    Celebrate Intellectual Diversity

Re: Help with regex
by AppleFritter (Vicar) on May 07, 2014 at 10:20 UTC

    Here's how I'd rewrite your program:

    #!/usr/bin/perl use warnings; use strict; use diagnostics; my $found_query = 0; while(<>) { if($found_query) { m/^>([^\s]+)/ and do { print "$1\n"; $found_query = 0; } } else { m/^Query=\s([^\s]+)/ and do { print "$1\t"; $found_query = 1; } } }

    Does this do what you want?

Re: Help with regex
by trizen (Hermit) on May 07, 2014 at 10:18 UTC
    Try this:
    #!/usr/bin/perl use strict; use warnings; use diagnostics; my $file = $ARGV[0]; open(my $fh, '<', $file) or die "$file: $!"; my $query; while (<$fh>) { if (/^Query=\s(\S+)/) { $query = $1; } if (defined($query) && /^>(\S+)/) { print "$query\t$1\n"; undef $query; } }
      Thanks a lot! worked like a charm!