pg has asked for the wisdom of the Perl Monks concerning the following question:

I have this piece of code to capture HTTP request from browser:

sub recv_req { my $browser = shift; my $content_length = 0; my $req = ""; while (1) { my $chunk; $browser->recv($chunk, 10000); if ($chunk =~ m/Content-Length: (\d*)/) { $content_length = $1; print "content_length = $content_length\n"; } $req .= $chunk; last if ($chunk =~ "\r\n\r\n"); } $req =~ /(.*?)\r\n\r\n(.*)/s; if (length($2) > 0) { $content_length -= length($2); print "after -, content_length = $content_length\n"; print "[$2]\n";# I added this line minutes ago to capture $2 } while ($content_length > 0) { my $chunk; $browser->recv($chunk, $content_length); $req .= $chunk; $content_length -= length($chunk); } return $req; }

For this line,

print "after -, content_length = $content_length\n";

I expect it to print 0 for the last time it goes into the loop (if ever). However it prints -2 occasionaly (I cannot remember whether this happens to other web site, but for sure some particular request for this site)

This has been going on for a while, but today I decided to figure out what's going on, so I added:

print "[$2]\n";

to capture the content.

In one instance, I captured this, when I tried to delete a message:

after -, content_length = -2 [node_id=3628&deletemsg_540904478=yup&op=message&message=&message_send +=talk&.cgi #this line break here is caused by screen width fields=deletemsg_540904478 ]

It seems to me that there is a "\r\n" at the very end of the content, which is not counted in Content-Length, and that's where that -2 comes from.

I think there is a standard comformance issue somewhere, but I would like others to double check my script.

Replies are listed 'Best First'.
Re: wrong Content-Length
by dws (Chancellor) on Jan 10, 2004 at 07:04 UTC
    If you're on Win32, throw in a   binmode($browser); (or equivalent).
Re: wrong Content-Length
by Popcorn Dave (Abbot) on Jan 10, 2004 at 06:07 UTC
    I'm not sure if this is going to help you but I ran in to it when I was doing some page scraping. Some pages used \r\n, some used \n and I believe that some even used \r - although it's been a while and I could be wrong about the last one.

    Is it possible that you're checking for a Windows return structure when it's a Unix one?

    Hope that helps!

    There is no emoticon for what I'm feeling now.

Re: wrong Content-Length
by William G. Davis (Friar) on Jan 10, 2004 at 06:58 UTC
            $browser->recv($chunk, 10000);

    Most common platforms use IO blocks of 4096 bytes in size, so reading *just* that much is usually a good idea, that way you only use one block, which is probably big enough, and no more. If you really like 10000, then you should use 10240 instead, but I know of very few platforms that use IO blocks of that size. Even if you use the smaller block size when reading, odds are you often won't even fill that up entirely since most HTTP requests aren't that big, and if one is, you'll probably start reading before much of it has been sent. But at least this way you only use one block instead of the three with 10000.

    last if ($chunk =~ "\r\n\r\n");

    Popcorn Dave addressed this. \r\n, though a common idiom, cannot be used to portably match CRLF. \n means linefeed on Unix, carriage return on MacOS, and on Windows, when reading from a text file, CRLF is converted to LF so \n works properly. Instead of \r\n, use the octals \015\012 to match CRLF. Please also read this.

    Other then that, everything looks pretty good. You might want to consider replacing length() with something else though, since length() returns the length of a string in characters, whereas Content-Length: contains the size of the POST data/response in bytes, which means multi-byte Unicode characters might cause you some problems. See this node here. Also, Many of those regular expressions could probably be replaced with the more effecient simple string manipulation functions like index(), rindex(), and substr().