in reply to Help with regex
G'day rocketperl,
You need to be more precise about what you want: your text says 'extract all the alpha numeric characters after "Query="'; your code says 'extract all non-whitespace characters after "Query= "'. In the code below, I've guessed the code is correct.
Here's the basic code you need to achieve what you want:
{ local $/ = 'Query= '; while (<$filehandle>) { print "$1\t$2" if / \A (\S+) .*? ^ > (\S+) /msx; } }
I suggest you read the open documentation for a better way to open your files (i.e. 3-argument form with a lexical filehandle). You should also check that your I/O has worked: either hand-craft messages as shown in the open documentation or, and this is my general preference, let Perl do this for you by using the autodie pragma.
My test code, test data and output is in the spoiler:
#!/usr/bin/env perl -l use strict; use warnings; { local $/ = 'Query= '; while (<DATA>) { print "$1\t$2" if / \A (\S+) .*? ^ > (\S+) /msx; } } __DATA__ Query= sp|P30443|1A01_HUMAN HLA class I histocompatibility antigen, A-1 alpha chain OS=Homo sapiens GN=HLA-A PE=1 SV=1 (365 letters) ... >tr|G1KTN1|G1KTN1_ANOCA Uncharacterized protein OS=Anolis carolinensis PE=3 SV=2 Length = 358 ... >tr|L7MZX2|L7MZX2_ANOCA Uncharacterized protein OS=Anolis carolinensis GN=LOC100559978 PE=3 SV=1 Length = 364 ... Query= Xsp|P30443|1A01_HUMAN HLA class I histocompatibility antigen, A-1 alpha chain OS=Homo sapiens GN=HLA-A PE=1 SV=1 (365 letters) ... >Xtr|G1KTN1|G1KTN1_ANOCA Uncharacterized protein OS=Anolis carolinensis PE=3 SV=2 Length = 358 ... >Xtr|L7MZX2|L7MZX2_ANOCA Uncharacterized protein OS=Anolis carolinensis GN=LOC100559978 PE=3 SV=1 Length = 364 ...
Output:
sp|P30443|1A01_HUMAN tr|G1KTN1|G1KTN1_ANOCA Xsp|P30443|1A01_HUMAN Xtr|G1KTN1|G1KTN1_ANOCA
-- Ken
|
|---|