Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much

Strange regex behavior - beware chunk boundaries!

by tame1 (Pilgrim)
on Aug 14, 2005 at 21:02 UTC ( #483722=perlquestion: print w/replies, xml ) Need Help??

tame1 has asked for the wisdom of the Perl Monks concerning the following question:

UPDATE: Seems that the data I was testing for landed right on a buffer chunk boundary. Not good. Let that be a lesson to all us part timers - boundaries suck!
Recently, I had to drag my old perl knowledge up from the bottom of my brain to do a little db creation work.

I am taking a web page from our local chamber of commerce, where they give an alphabetical listing of their members, and sucking it in with Net::HTTP. I then cycle through the $buffer looking for occurances of "ID=XXXX". Those are links to "more info" on each company. Using the built-up array of IDs, I then pull each companies individual data.

Anyhow, to make a long story short (too late, right?) one business, serial number 3975, is always skipped!!! the regex $buf =~ /ID=([0-9]+)/ seems to think 3975 doesn't match! The only answer I have found is to first write the main web page to a file, then read in the file. THEN it matches.

Here is the code I am/was using:

#!/usr/bin/perl use strict; use Net::HTTP; use HTML::Strip; use LWP::Simple; my $DOMAIN=""; my $MAIN_LIST="AlphabeticalListing.asp"; my $HOMEDIR="/home/jrobiso2/Documents/CDS/Chamber/"; my $list_file="/tmp/AlphabeticalListing.html"; my @listing; ### Get initial listing of data from the main page. open(SRC, "+>$list_file") or die "Cannot open file: $!\n"; my $http = Net::HTTP->new(Host => $DOMAIN) || die $!; $http->keep_alive; $http->write_request(GET => "/$MAIN_LIST", 'User Agent' => "Mozilla/5. +0"); my($code, $mess, %h) = $http->read_response_headers; ## Build the listing of company numbers from the javascript window.ope +n ## calls inside the main listing html page. while (1) { my $buf; my $n = $http->read_entity_body($buf, 1024); die "read failed: $!" unless defined $n; last unless $n; # if ($buf =~ /ID=([0-9]+)/) { # OLD CODE # push @listing, $1; # THAT FAILED # } # print SRC $buf; } close(SRC); open(SRC, "<$list_file") || die "Cannot open file: $!\n"; while (<SRC>) { if (/ID=([0-9]+)/) { push @listing, $1; } } @listing = sort @listing;
From the code above you can see the actual site and page I am trying to steal from. If anyone can enlighten me as to what is wrong (what I have done wrong) I would greatly appreciate it, as this has taken 3 hours of my time and made me feel very stupid. I am using perl 5.8.6.

What does this little button do . .<Click>; "USER HAS SIGNED OFF FOR THE DAY"

Replies are listed 'Best First'.
Re: Strange regex behavior
by BrowserUk (Patriarch) on Aug 14, 2005 at 21:15 UTC

    I haven't looked, but I bet that the number 3975 spans a 1024-byte boundary in the file.

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.

      Or it could be the second match in one 1024-byte block.

      I think the easiest way to do this would be to pull the entire page into $buf, and then do:

      @listing = ($buf =~ /ID=(\d+)/g);

Re: Strange regex behavior
by Tanktalus (Canon) on Aug 14, 2005 at 21:46 UTC
    $ perl -e '$/=undef;$t=<>;foreach(3815,3975,3871){$i=index($t,"ID=$_") +;printf"ID=$_:%d(chunk=%d,offset=%d)\n",$i,int($i/1024),$i%1024}' Alp +habeticalListing.asp ID=3815:103836(chunk=101,offset=412) ID=3975:104688(chunk=102,offset=240) ID=3871:105271(chunk=102,offset=823)

    Ok, perhaps I could have used some spaces on that one-liner, but I was having too much fun this way. Oddly, it seems that 3975 is the first match in its chunk, so I would have expected it to be 3871 that got missed.

    You should try this as your loop:

    while (1) { my $buf; my $n = $http->read_entity_body($buf, 1024); die "read failed: $!" unless defined $n; last unless $n; push @listing, $buf =~ /ID=(\d+)/g; }
    Note that the problem still could exist where the literal string "ID=xxxx" crosses over the boundary - say "ID=3" at the end of one 1024-byte chunk, and "975" at the beginning of the next. It's probably easiest to slurp the whole thing in, and then do a single global match.

Re: Strange regex behavior
by insaniac (Friar) on Aug 14, 2005 at 21:16 UTC
    update: my sleepy eyes didn't look very well at your code :-/
    forget my post ;-)

    is it just a typo, or did you forget a tilde?

    if (/ID=([0-9]+)/) {
    ? shouldn't this be:
    if (/ID=~([0-9]+)/) {

    to ask a question is a moment of shame
    to remain ignorant is a lifelong shame

      Remember, ID= is part of the actual text I am searching for. Perhaps you were thinking along the lines of $var =~ /some-text/, right?

      Anyhow, without specifying what var to work on, that "if" works on $_.
      So it's really if $_ =~ /ID=([0-9]+)/


      What does this little button do . .<Click>; "USER HAS SIGNED OFF FOR THE DAY"
      Nope. That one is as it should be, I think. --Jon

      What does this little button do . .<Click>; "USER HAS SIGNED OFF FOR THE DAY"
Re: Strange regex behavior - beware chunk boundaries!
by dws (Chancellor) on Aug 15, 2005 at 03:24 UTC

    f anyone can enlighten me as to what is wrong I would greatly appreciate it.

    You might find Matching in Huge Files useful. It describes a technique for matching across chunk boundaries without having to first suck the entire file into memory.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://483722]
Approved by Corion
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (4)
As of 2022-12-08 19:35 GMT
Find Nodes?
    Voting Booth?

    No recent polls found