in reply to Re: string pattern match, limited to first 1000 characters?
in thread string pattern match, limited to first 1000 characters?

I dont understand why you did ^\A. Is there a reason?

---
$world=~s/war/peace/g

Replies are listed 'Best First'.
Re^3: string pattern match, limited to first 1000 characters?
by shmem (Chancellor) on Jun 23, 2007 at 15:23 UTC
    Other than ignorance, no.

    --shmem

    _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                  /\_¯/(q    /
    ----------------------------  \__(m.====·.(_("always off the crowd"))."·
    ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
      Ok, this is quite interesting. I've tried all of the suggested approaches, and based on a sample of actual text that I need to run on, the following is the overall best performing:
      $str =~ /\A.{0,995}?<html/i
      (I added the '<' to the text to make it more specific, and also removed the /s qualifier -- in my situation, both of these tweaks boosted performance.)

      But now the plot thickens ... what I just realized is that sometimes my content is gzipped (i.e. served by an Apache web server with mod_deflate, aka "Content-Encoding: gzip", or deflate, or compress). I have used some good CPAN modules that inflate this type of content, but now I'm faced with the same dilemna ... if I've got a 20KB gzipped file (100KB inflated), and I still only care about checking the first 1000 characters of the inflated content, is there a way to do a "partial inflate" so I don't have to incur the full overhead of a total-file inflation? I know this seems like a long-shot, but I figured I'd ask for ideas.

      MFN
        So your are just checking whether there's a <html> tag in some data stream scraped off a webserver? That explains why the regexp is fastest - normally opening html tags are pretty much at the beginning of a HTML page.

        Dunno whether partial gunzipping is possible with a perl module, but you can always open a pipe to/from gunzip and kill the process off when you have 1000 bytes read.

        May I ask what nefarious purpose you need all that for?

        --shmem

        _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                      /\_¯/(q    /
        ----------------------------  \__(m.====·.(_("always off the crowd"))."·
        ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}