in reply to Regex perplexity

Needs a little more than a regex, I think. Here's my take on it:
#!/usr/bin/perl -w use strict; my ($firstQ, $firstS); my ($lastQ, $lastS); while (!(defined($firstQ) && defined($firstS))) { $_=<>; $firstQ=$1 if(/^Query:\s+?(\d+)[\sgcat]*(\d+)/) && do{$lastQ=$ +2}; $firstS=$1 if(/^Sbjct:\s+?(\d+)[\sgcat]*(\d+)/) && do{$lastS=$ +2}; } while(<>) { $lastQ=$2 if (/^Query:\s+?(\d+)[\sgcat]*(\d+)/); $lastS=$2 if (/^Sbjct:\s+?(\d+)[\sgcat]*(\d+)/); } print "First:\t$firstQ\t$firstS\n"; print "Last:\t$lastQ\t$lastS\n"; ## Output of large dataset: ############# First: 11 1 Last: 370 360 ## Output of small dataset: ############# First: 249 64265 Last: 271 64243

Code works with both a single data item or multiple data items spread across several lines. Now you can make the regexes in the second while loop more efficient by taking out the first capturing parentheses but I wanted to keep them the same as the first while loop so you could see the symmetry and what I was doing with the regex.

HTH

Replies are listed 'Best First'.
Re: Re: Regex perplexity
by eweaverp (Scribe) on Jun 30, 2003 at 22:24 UTC
    Thanks. I wasn't entirely clear on the input format, each file has a number of those data elements in it (i.e. the example is one file); which I need to extract by ID number. I have altered your code to this (where $id_flag is triggered once we come across the appropriate ID number):

    sub FindPositions { my $string = $_[0]; my $id = $_[1]; my ($firstQ, $firstS); my ($lastQ, $lastS); my $id_flag; my $line; # pipe-ize the string my $string_pipe = new FileHandle("echo \'$string\' |") or die; while (!(defined($id_flag) && defined($firstQ) && defined($firstS))) { $line = <$string_pipe>; $id_flag = 1 if ($line =~ /<a name = $id>/); $firstQ = $1 if ($line =~ /^Query:\s+?(\d+)[\sgcat]*(\d+)/) && do{ +$lastQ=$2}; $firstS = $1 if ($line =~ /^Sbjct:\s+?(\d+)[\sgcat]*(\d+)/) && do{ +$lastS=$2}; } foreach $line (<$string_pipe>) { $lastQ = $2 if ($line =~ /^Query:\s+?(\d+)[\sgcat]*(\d+)/); $lastS = $2 if ($line =~ /^Sbjct:\s+?(\d+)[\sgcat]*(\d+)/); } return ($firstQ, $firstS, $lastQ, $lastS); }
    But I'm not sure how to make it grab the appropriate ending values. As is it grabs the last ones in the file.

    Thanks
      Nevermind; I'm dumb. I just add this line:
      # slice out the appropriate part of the string ($string) = $string =~ /(><a name = $id>.*?<\/pre>)/s;

      to the beginning of the subroutine and remove the $id_flag weirdness and it works great. Thanks for the example code, and the multi-loop approach. Apparently there _are_ some things a regex can't do!

      Cheers all,

      Evan
        And... that made me realize that this:
        sub FindPositions { my $string = $_[0]; my $id = $_[1]; # slice out the appropriate part of the string ($string) = $string =~ /(><a name = $id>.*?<\/pre>)/s; my @positions = $string =~ m/<a name = $id>.*?Query: (\d+).*?Sbjct: +(\d+).*?<\/pre>/s; push (@positions, $string =~ m/<a name = $id>.*Query: \d+\s+[a-z]+ ( +\d+)\n.*\nSbjct: \d+\s+[a-z]+ (\d+)\n<\/pre>/s); return @positions; }
        works too. First and last. Hmm. Anyway...