Re^2: string pattern match, limited to first 1000 characters?

Replies are listed 'Best First'.
Re^3: string pattern match, limited to first 1000 characters? by shmem (Chancellor) on Jun 23, 2007 at 15:23 UTC
Other than ignorance, no. --shmem _($_=" "x(1<<5)."?\n".q·/)Oo. G°\ / /\_¯/(q / ---------------------------- \__(m.====·.(_("always off the crowd"))."· ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}	[reply]
Re^4: string pattern match, limited to first 1000 characters? by ManFromNeptune (Scribe) on Jun 24, 2007 at 04:40 UTC
Ok, this is quite interesting. I've tried all of the suggested approaches, and based on a sample of actual text that I need to run on, the following is the overall best performing: `$str =~ /\A.{0,995}?<html/i` [download] (I added the '<' to the text to make it more specific, and also removed the /s qualifier -- in my situation, both of these tweaks boosted performance.) But now the plot thickens ... what I just realized is that sometimes my content is gzipped (i.e. served by an Apache web server with mod_deflate, aka "Content-Encoding: gzip", or deflate, or compress). I have used some good CPAN modules that inflate this type of content, but now I'm faced with the same dilemna ... if I've got a 20KB gzipped file (100KB inflated), and I still only care about checking the first 1000 characters of the inflated content, is there a way to do a "partial inflate" so I don't have to incur the full overhead of a total-file inflation? I know this seems like a long-shot, but I figured I'd ask for ideas. MFN	[reply] [d/l]
Re^5: string pattern match, limited to first 1000 characters? by shmem (Chancellor) on Jun 24, 2007 at 10:04 UTC
So your are just checking whether there's a `<html>` tag in some data stream scraped off a webserver? That explains why the regexp is fastest - normally opening html tags are pretty much at the beginning of a HTML page. Dunno whether partial gunzipping is possible with a perl module, but you can always open a pipe to/from gunzip and kill the process off when you have 1000 bytes read. May I ask what nefarious purpose you need all that for? --shmem _($_=" "x(1<<5)."?\n".q·/)Oo. G°\ / /\_¯/(q / ---------------------------- \__(m.====·.(_("always off the crowd"))."· ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}	[reply] [d/l]
Re^6: string pattern match, limited to first 1000 characters? by ManFromNeptune (Scribe) on Jun 25, 2007 at 17:32 UTC