js1 has asked for the wisdom of the Perl Monks concerning the following question:

Enlightened Ones,

I have an http log file which I need to parse and pull certain values out of. I started off a script below to do this but got stuck when I realised that one of the strings I need to test for spans 2 lines. The string I'm trying to find here is "Content Compression to Date". Here's the relevant extract:

<tr> <td height="16" width="180" class="jnpsInput" nowrap><stro +ng>Content Compression to Date</strong></td> <td height="16" width="19" class="jnpsInput" nowrap><stron +g>=</strong></td>

This is my script:

while(<>){ if(/Content-Length:\s*(\d*)/){ $contentlength=$1; print "\nContent-Length: $contentlength" if $DEBUG; } next if $contentlength < 15000; if(/Expires:\s*\S*\s*\S*\s*\S*\s*\S*\s*(\S*)/){ $expires=$1; print "\nExpires: $expires" if $DEBUG; } if(/**** ???? ******/){ print "\nContent Compression to Date found\n" if $DEBU +G; } }

Does anyone know what regex I need to find this?

Thanks for any help.

js1.

Replies are listed 'Best First'.
Re: multiple line regex
by matija (Priest) on Apr 23, 2004 at 09:13 UTC
    When you try to parse HTML with a regex, you are in a state of sin. When you try to parse HTML tables with regexes, you are really riding the bad karma train.

    If you insist on parsing HTML yourself, do yourself a favor and use HTML::Parser or HTML::TokeParser.

    Given that your data is in an HTML table, though, I strongly recommend HTML::TableExtract.

      I think he is not really parsing the file - he is rather just scanning it for particular values - so I think in this case using regexps is justified. Of course that's just a guess.

        Parse

        4. Computer Science. To analyze or separate (input, for example) into more easily processed components.

        Granted this is a log file with a (probably) fixed format, so it's only slightly evil to use a regex instead of a proper parser. But it is parsing none the less.

Re: multiple line regex
by BrowserUk (Patriarch) on Apr 23, 2004 at 09:37 UTC

    If you can't slurp the whole file then you could do something like this.

    #! perl -slw use strict; my $buffer; while( <DATA> ) { $buffer .= $_; # Accumulate lines in a buffer # Use \s+ instead of space to allow for newlines between words # The /g option is required to set pos() if( $buffer =~ /Content\s+Compression\s+to\s+Date/g ) { print "Content Compression to Date found"; # Once you find something your looking for # throw away everything that preceded it with substr() substr( $buffer, 0, pos( $buffer ) ) =''; } } __DATA__ Content Compression to Date other stuff other stuff other stuff Content Compression to Date other stuff other stuff other stuff Content Compression to Date Content Compression to Date other stuff other stuff Content Compression to Date other stuff other stuff

    Output

    P:\test>test2 Content Compression to Date found Content Compression to Date found Content Compression to Date found Content Compression to Date found Content Compression to Date found

    Notes:

    You will need to use the /g option on ALL the regexes, even though you are only looking for the first match each time in order for pos to be set. (I never quite understood that?).

    You will also need to do the substr thing to throw away everything already matched after each successful match.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail

      Thanks for your replies. Both of them are very useful.

      js1.