eweaverp has asked for the wisdom of the Perl Monks concerning the following question:

Hola Monks,

I am trying to extract the first and last Query and Subject numbers from data like the following:
><a name = 32169281></a><a href="http://www.ncbi.nlm.nih.gov/entrez/qu +ery.fcgi?cmd=Retrieve&db=Nucleotide&list_uids=32169281&dopt=GenBank" +>emb|AJ544060.1|GGA544060</a> Gallus gallus mRNA for granzyme A precu +rsor (GZMA gene) Length = 1100 Score = 721 bits (360), Expect = 0.0 Identities = 360/360 (100%) Strand = Plus / Plus + Query: 11 catgggtgtttttttcactctgtccacctctgctgccatcgttctcctgatacttcctg +g 70 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| +| Sbjct: 1 catgggtgtttttttcactctgtccacctctgctgccatcgttctcctgatacttcctg +g 60 + Query: 71 agatttgtgcgtggatatcattggaggacatgaagtagcaccacactcaagaccattta +t 130 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| +| Sbjct: 61 agatttgtgcgtggatatcattggaggacatgaagtagcaccacactcaagaccattta +t 120 + Query: 131 ggccatgctcaaaggaaaagaattttgtggaggagctttgatcaagccaagctgggtgt +t 190 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| +| Sbjct: 121 ggccatgctcaaaggaaaagaattttgtggaggagctttgatcaagccaagctgggtgt +t 180 + Query: 191 aacagctgctcattgcaatctgaagggaggcagagttattcttggagcccattcacgga +c 250 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| +| Sbjct: 181 aacagctgctcattgcaatctgaagggaggcagagttattcttggagcccattcacgga +c 240 + Query: 251 aaaaagagaagaagaagaacaggttattgagattgcagaagaaattcgctacccagact +a 310 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| +| Sbjct: 241 aaaaagagaagaagaagaacaggttattgagattgcagaagaaattcgctacccagact +a 300 + Query: 311 ctgtcccgaaagaaaggaacatgacattatgctgttgaagcttaagaaaagagcaaaaa +t 370 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| +| Sbjct: 301 ctgtcccgaaagaaaggaacatgacattatgctgttgaagcttaagaaaagagcaaaaa +t 360 ><a name = 16647294></a><a href="http://www.ncbi.nlm.nih.gov/entrez/qu +ery.fcgi?cmd=Retrieve&db=Nucleotide&list_uids=16647294&dopt=GenBank" +>gb|AC016999.8|</a> Homo sapiens BAC clone RP11-40B20 from 2, complet +e sequence Length = 93778 Score = 46.6 bits (23), Expect = 0.028 Identities = 23/23 (100%) Strand = Plus / Minus Query: 249 acaaaaagagaagaagaagaaca 271 ||||||||||||||||||||||| Sbjct: 64265 acaaaaagagaagaagaagaaca 64243
The results should be 249 64265 271 64243 (for the second, smaller block) and 11 1 370 360 (for the first block). I have tried this, where $id is the <a name=> number:
my @positions = $string =~ m/<a name = $id>.*?Query: (\d+).*?Sbjct +: (\d+).*?<\/pre>/s; push (@positions, $string =~ m/<a name = $id>.*?Query: \d+\s+[a-z] ++ (\d+)\n.*?\nSbjct: \d+\s+[a-z]+ (\d+)\n<\/pre>/s);
But this doesn't grab the last query properly. What can I do that avoids getting the wrong 'last' numbers? I need them from the same number block, and combinations of greedy/non-greedy matching aren't working out for me, I either get partial results (not the true last ones) or last ones from the last block in the page.

Help!

Thanks,

Evan



Replies are listed 'Best First'.
Re: Regex perplexity
by pzbagel (Chaplain) on Jun 30, 2003 at 19:14 UTC
    Needs a little more than a regex, I think. Here's my take on it:
    #!/usr/bin/perl -w use strict; my ($firstQ, $firstS); my ($lastQ, $lastS); while (!(defined($firstQ) && defined($firstS))) { $_=<>; $firstQ=$1 if(/^Query:\s+?(\d+)[\sgcat]*(\d+)/) && do{$lastQ=$ +2}; $firstS=$1 if(/^Sbjct:\s+?(\d+)[\sgcat]*(\d+)/) && do{$lastS=$ +2}; } while(<>) { $lastQ=$2 if (/^Query:\s+?(\d+)[\sgcat]*(\d+)/); $lastS=$2 if (/^Sbjct:\s+?(\d+)[\sgcat]*(\d+)/); } print "First:\t$firstQ\t$firstS\n"; print "Last:\t$lastQ\t$lastS\n"; ## Output of large dataset: ############# First: 11 1 Last: 370 360 ## Output of small dataset: ############# First: 249 64265 Last: 271 64243

    Code works with both a single data item or multiple data items spread across several lines. Now you can make the regexes in the second while loop more efficient by taking out the first capturing parentheses but I wanted to keep them the same as the first while loop so you could see the symmetry and what I was doing with the regex.

    HTH

      Thanks. I wasn't entirely clear on the input format, each file has a number of those data elements in it (i.e. the example is one file); which I need to extract by ID number. I have altered your code to this (where $id_flag is triggered once we come across the appropriate ID number):

      sub FindPositions { my $string = $_[0]; my $id = $_[1]; my ($firstQ, $firstS); my ($lastQ, $lastS); my $id_flag; my $line; # pipe-ize the string my $string_pipe = new FileHandle("echo \'$string\' |") or die; while (!(defined($id_flag) && defined($firstQ) && defined($firstS))) { $line = <$string_pipe>; $id_flag = 1 if ($line =~ /<a name = $id>/); $firstQ = $1 if ($line =~ /^Query:\s+?(\d+)[\sgcat]*(\d+)/) && do{ +$lastQ=$2}; $firstS = $1 if ($line =~ /^Sbjct:\s+?(\d+)[\sgcat]*(\d+)/) && do{ +$lastS=$2}; } foreach $line (<$string_pipe>) { $lastQ = $2 if ($line =~ /^Query:\s+?(\d+)[\sgcat]*(\d+)/); $lastS = $2 if ($line =~ /^Sbjct:\s+?(\d+)[\sgcat]*(\d+)/); } return ($firstQ, $firstS, $lastQ, $lastS); }
      But I'm not sure how to make it grab the appropriate ending values. As is it grabs the last ones in the file.

      Thanks
        Nevermind; I'm dumb. I just add this line:
        # slice out the appropriate part of the string ($string) = $string =~ /(><a name = $id>.*?<\/pre>)/s;

        to the beginning of the subroutine and remove the $id_flag weirdness and it works great. Thanks for the example code, and the multi-loop approach. Apparently there _are_ some things a regex can't do!

        Cheers all,

        Evan
Re: Regex perplexity
by Skeeve (Parson) on Jul 01, 2003 at 06:28 UTC
    I would do it like this:
    my (%first, %last); while (<DATA>) { next unless /^(Query|Sbjct):\s+(\d+)\s+(?:.*)\s(\d+)\s*$/; $last{$1}= $3; next if exists $first{$1}; $first{$1}=$2; } print<<RESULT; $first{'Query'} $first{'Sbjct'} $last{'Query'} $last{'Sbjct'} RESULT