Re: Help with regex

G'day rocketperl,

You need to be more precise about what you want: your text says 'extract all the alpha numeric characters after "Query="'; your code says 'extract all non-whitespace characters after "Query= "'. In the code below, I've guessed the code is correct.

Here's the basic code you need to achieve what you want:

{
    local $/ = 'Query= ';

    while (<$filehandle>) {
        print "$1\t$2" if / \A (\S+) .*? ^ > (\S+) /msx;
    }
}
[download]

I suggest you read the open documentation for a better way to open your files (i.e. 3-argument form with a lexical filehandle). You should also check that your I/O has worked: either hand-craft messages as shown in the open documentation or, and this is my general preference, let Perl do this for you by using the autodie pragma.

My test code, test data and output is in the spoiler:

#!/usr/bin/env perl -l

use strict;
use warnings;

{
    local $/ = 'Query= ';

    while (<DATA>) {
        print "$1\t$2" if / \A (\S+) .*? ^ > (\S+) /msx;
    }
}

__DATA__

Query= sp|P30443|1A01_HUMAN HLA class I histocompatibility antigen,
A-1 alpha chain OS=Homo sapiens GN=HLA-A PE=1 SV=1
         (365 letters)

...
>tr|G1KTN1|G1KTN1_ANOCA Uncharacterized protein OS=Anolis
           carolinensis PE=3 SV=2
          Length = 358
...
>tr|L7MZX2|L7MZX2_ANOCA Uncharacterized protein OS=Anolis
           carolinensis GN=LOC100559978 PE=3 SV=1
          Length = 364
...

Query= Xsp|P30443|1A01_HUMAN HLA class I histocompatibility antigen,
A-1 alpha chain OS=Homo sapiens GN=HLA-A PE=1 SV=1
         (365 letters)

...
>Xtr|G1KTN1|G1KTN1_ANOCA Uncharacterized protein OS=Anolis
           carolinensis PE=3 SV=2
          Length = 358
...
>Xtr|L7MZX2|L7MZX2_ANOCA Uncharacterized protein OS=Anolis
           carolinensis GN=LOC100559978 PE=3 SV=1
          Length = 364
...
[download]

Output:

sp|P30443|1A01_HUMAN    tr|G1KTN1|G1KTN1_ANOCA
Xsp|P30443|1A01_HUMAN   Xtr|G1KTN1|G1KTN1_ANOCA
[download]

-- Ken

Comment on Re: Help with regex Select or Download Code