Other than ignorance, no.
--shmem
_($_=" "x(1<<5)."?\n".q·/)Oo. G°\ /
/\_¯/(q /
---------------------------- \__(m.====·.(_("always off the crowd"))."·
");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
| [reply] |
Ok, this is quite interesting. I've tried all of the suggested approaches, and based on a sample of actual text that I need to run on, the following is the overall best performing:
$str =~ /\A.{0,995}?<html/i
(I added the '<' to the text to make it more specific, and also removed the /s qualifier -- in my situation, both of these tweaks boosted performance.)
But now the plot thickens ... what I just realized is that sometimes my content is gzipped (i.e. served by an Apache web server with mod_deflate, aka "Content-Encoding: gzip", or deflate, or compress). I have used some good CPAN modules that inflate this type of content, but now I'm faced with the same dilemna ... if I've got a 20KB gzipped file (100KB inflated), and I still only care about checking the first 1000 characters of the inflated content, is there a way to do a "partial inflate" so I don't have to incur the full overhead of a total-file inflation? I know this seems like a long-shot, but I figured I'd ask for ideas.
MFN
| [reply] [d/l] |
So your are just checking whether there's a <html> tag in some data stream scraped off a webserver? That explains why the regexp is fastest - normally opening html tags are pretty much at the beginning of a HTML page.
Dunno whether partial gunzipping is possible with a perl module, but you can always open a pipe to/from gunzip and kill the process off when you have 1000 bytes read.
May I ask what nefarious purpose you need all that for?
--shmem
_($_=" "x(1<<5)."?\n".q·/)Oo. G°\ /
/\_¯/(q /
---------------------------- \__(m.====·.(_("always off the crowd"))."·
");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
| [reply] [d/l] |