Regex perplexity

eweaverp has asked for the wisdom of the Perl Monks concerning the following question:

Hola Monks,

I am trying to extract the first and last Query and Subject numbers from data like the following:

><a name = 32169281></a><a href="http://www.ncbi.nlm.nih.gov/entrez/qu
+ery.fcgi?cmd=Retrieve&db=Nucleotide&list_uids=32169281&dopt=GenBank" 
+>emb|AJ544060.1|GGA544060</a> Gallus gallus mRNA for granzyme A precu
+rsor (GZMA gene)
          Length = 1100

 Score =  721 bits (360), Expect = 0.0
 Identities = 360/360 (100%)
 Strand = Plus / Plus

                                                                      
+ 
Query: 11  catgggtgtttttttcactctgtccacctctgctgccatcgttctcctgatacttcctg
+g 70
           |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
+| 
Sbjct: 1   catgggtgtttttttcactctgtccacctctgctgccatcgttctcctgatacttcctg
+g 60

                                                                      
+ 
Query: 71  agatttgtgcgtggatatcattggaggacatgaagtagcaccacactcaagaccattta
+t 130
           |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
+|
Sbjct: 61  agatttgtgcgtggatatcattggaggacatgaagtagcaccacactcaagaccattta
+t 120

                                                                      
+ 
Query: 131 ggccatgctcaaaggaaaagaattttgtggaggagctttgatcaagccaagctgggtgt
+t 190
           |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
+|
Sbjct: 121 ggccatgctcaaaggaaaagaattttgtggaggagctttgatcaagccaagctgggtgt
+t 180

                                                                      
+ 
Query: 191 aacagctgctcattgcaatctgaagggaggcagagttattcttggagcccattcacgga
+c 250
           |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
+|
Sbjct: 181 aacagctgctcattgcaatctgaagggaggcagagttattcttggagcccattcacgga
+c 240

                                                                      
+ 
Query: 251 aaaaagagaagaagaagaacaggttattgagattgcagaagaaattcgctacccagact
+a 310
           |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
+|
Sbjct: 241 aaaaagagaagaagaagaacaggttattgagattgcagaagaaattcgctacccagact
+a 300

                                                                      
+ 
Query: 311 ctgtcccgaaagaaaggaacatgacattatgctgttgaagcttaagaaaagagcaaaaa
+t 370
           |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
+|
Sbjct: 301 ctgtcccgaaagaaaggaacatgacattatgctgttgaagcttaagaaaagagcaaaaa
+t 360


><a name = 16647294></a><a href="http://www.ncbi.nlm.nih.gov/entrez/qu
+ery.fcgi?cmd=Retrieve&db=Nucleotide&list_uids=16647294&dopt=GenBank" 
+>gb|AC016999.8|</a> Homo sapiens BAC clone RP11-40B20 from 2, complet
+e sequence
          Length = 93778

 Score = 46.6 bits (23), Expect = 0.028
 Identities = 23/23 (100%)
 Strand = Plus / Minus

                                    
Query: 249   acaaaaagagaagaagaagaaca 271
             |||||||||||||||||||||||
Sbjct: 64265 acaaaaagagaagaagaagaaca 64243
[download]

The results should be 249 64265 271 64243 (for the second, smaller block) and 11 1 370 360 (for the first block). I have tried this, where $id is the <a name=> number:

    my @positions = $string =~ m/<a name = $id>.*?Query: (\d+).*?Sbjct
+: (\d+).*?<\/pre>/s;
    push (@positions, $string =~ m/<a name = $id>.*?Query: \d+\s+[a-z]
++ (\d+)\n.*?\nSbjct: \d+\s+[a-z]+ (\d+)\n<\/pre>/s);
[download]

But this doesn't grab the last query properly. What can I do that avoids getting the wrong 'last' numbers? I need them from the same number block, and combinations of greedy/non-greedy matching aren't working out for me, I either get partial results (not the true last ones) or last ones from the last block in the page.

Help!

Thanks,

Evan

Comment on Regex perplexity Select or Download Code

Replies are listed 'Best First'.
Re: Regex perplexity by pzbagel (Chaplain) on Jun 30, 2003 at 19:14 UTC
Needs a little more than a regex, I think. Here's my take on it: #!/usr/bin/perl -w use strict; my ($firstQ, $firstS); my ($lastQ, $lastS); while (!(defined($firstQ) && defined($firstS))) { $_=<>; $firstQ=$1 if(/^Query:\s+?(\d+)[\sgcat](\d+)/) && do{$lastQ=$ +2}; $firstS=$1 if(/^Sbjct:\s+?(\d+)[\sgcat](\d+)/) && do{$lastS=$ +2}; } while(<>) { $lastQ=$2 if (/^Query:\s+?(\d+)[\sgcat](\d+)/); $lastS=$2 if (/^Sbjct:\s+?(\d+)[\sgcat](\d+)/); } print "First:\t$firstQ\t$firstS\n"; print "Last:\t$lastQ\t$lastS\n"; ## Output of large dataset: ############# First: 11 1 Last: 370 360 ## Output of small dataset: ############# First: 249 64265 Last: 271 64243 [download] Code works with both a single data item or multiple data items spread across several lines. Now you can make the regexes in the second while loop more efficient by taking out the first capturing parentheses but I wanted to keep them the same as the first while loop so you could see the symmetry and what I was doing with the regex. HTH	[reply] [d/l]
Re: Re: Regex perplexity by eweaverp (Scribe) on Jun 30, 2003 at 22:24 UTC
Thanks. I wasn't entirely clear on the input format, each file has a number of those data elements in it (i.e. the example is one file); which I need to extract by ID number. I have altered your code to this (where $id_flag is triggered once we come across the appropriate ID number): sub FindPositions { my $string = $_[0]; my $id = $_[1]; my ($firstQ, $firstS); my ($lastQ, $lastS); my $id_flag; my $line; # pipe-ize the string my $string_pipe = new FileHandle("echo \'$string\' \|") or die; while (!(defined($id_flag) && defined($firstQ) && defined($firstS))) { $line = <$string_pipe>; $id_flag = 1 if ($line =~ /<a name = $id>/); $firstQ = $1 if ($line =~ /^Query:\s+?(\d+)[\sgcat](\d+)/) && do{ +$lastQ=$2}; $firstS = $1 if ($line =~ /^Sbjct:\s+?(\d+)[\sgcat](\d+)/) && do{ +$lastS=$2}; } foreach $line (<$string_pipe>) { $lastQ = $2 if ($line =~ /^Query:\s+?(\d+)[\sgcat](\d+)/); $lastS = $2 if ($line =~ /^Sbjct:\s+?(\d+)[\sgcat](\d+)/); } return ($firstQ, $firstS, $lastQ, $lastS); } [download] But I'm not sure how to make it grab the appropriate ending values. As is it grabs the last ones in the file. Thanks	[reply] [d/l]
Re: Re: Re: Regex perplexity by eweaverp (Scribe) on Jun 30, 2003 at 22:30 UTC
Nevermind; I'm dumb. I just add this line: `# slice out the appropriate part of the string ($string) = $string =~ /(><a name = $id>.*?<\/pre>)/s;` [download] to the beginning of the subroutine and remove the $id_flag weirdness and it works great. Thanks for the example code, and the multi-loop approach. Apparently there _are_ some things a regex can't do! Cheers all, Evan	[reply] [d/l]
Re: Re: Re: Re: Regex perplexity by eweaverp (Scribe) on Jun 30, 2003 at 22:50 UTC
Re: Regex perplexity by Skeeve (Parson) on Jul 01, 2003 at 06:28 UTC
I would do it like this: `my (%first, %last); while (<DATA>) { next unless /^(Query\|Sbjct):\s+(\d+)\s+(?:.)\s(\d+)\s$/; $last{$1}= $3; next if exists $first{$1}; $first{$1}=$2; } print<<RESULT; $first{'Query'} $first{'Sbjct'} $last{'Query'} $last{'Sbjct'} RESULT` [download]	[reply] [d/l]