multiple line regex

js1 has asked for the wisdom of the Perl Monks concerning the following question:

Enlightened Ones,

I have an http log file which I need to parse and pull certain values out of. I started off a script below to do this but got stuck when I realised that one of the strings I need to test for spans 2 lines. The string I'm trying to find here is "Content Compression to Date". Here's the relevant extract:

 <tr>
            <td height="16" width="180" class="jnpsInput" nowrap><stro
+ng>Content
              Compression to Date</strong></td>
            <td height="16" width="19" class="jnpsInput" nowrap><stron
+g>=</strong></td>
[download]

This is my script:

while(<>){

        if(/Content-Length:\s*(\d*)/){
                $contentlength=$1;
                print "\nContent-Length: $contentlength" if $DEBUG;
        }

        next if $contentlength < 15000;

        if(/Expires:\s*\S*\s*\S*\s*\S*\s*\S*\s*(\S*)/){
                $expires=$1;
                print "\nExpires: $expires" if $DEBUG;
        }

        if(/**** ???? ******/){
                print "\nContent Compression to Date found\n" if $DEBU
+G;
        }
}
[download]

Does anyone know what regex I need to find this?

Thanks for any help.

js1.

Comment on multiple line regex Select or Download Code

Replies are listed 'Best First'.
Re: multiple line regex by matija (Priest) on Apr 23, 2004 at 09:13 UTC
When you try to parse HTML with a regex, you are in a state of sin. When you try to parse HTML tables with regexes, you are really riding the bad karma train. If you insist on parsing HTML yourself, do yourself a favor and use HTML::Parser or HTML::TokeParser. Given that your data is in an HTML table, though, I strongly recommend HTML::TableExtract.	[reply]
Re: Re: multiple line regex by zby (Vicar) on Apr 23, 2004 at 10:17 UTC
I think he is not really parsing the file - he is rather just scanning it for particular values - so I think in this case using regexps is justified. Of course that's just a guess.	[reply]
Re: Re: Re: multiple line regex by Fletch (Bishop) on Apr 23, 2004 at 12:46 UTC
Parse 4. Computer Science. To analyze or separate (input, for example) into more easily processed components. Granted this is a log file with a (probably) fixed format, so it's only slightly evil to use a regex instead of a proper parser. But it is parsing none the less.	[reply]
Re: Re: Re: Re: multiple line regex by zby (Vicar) on Apr 23, 2004 at 14:33 UTC
Re: multiple line regex by BrowserUk (Patriarch) on Apr 23, 2004 at 09:37 UTC
If you can't slurp the whole file then you could do something like this. #! perl -slw use strict; my $buffer; while( <DATA> ) { $buffer .= $_; # Accumulate lines in a buffer # Use \s+ instead of space to allow for newlines between words # The /g option is required to set pos() if( $buffer =~ /Content\s+Compression\s+to\s+Date/g ) { print "Content Compression to Date found"; # Once you find something your looking for # throw away everything that preceded it with substr() substr( $buffer, 0, pos( $buffer ) ) =''; } } __DATA__ Content Compression to Date other stuff other stuff other stuff Content Compression to Date other stuff other stuff other stuff Content Compression to Date Content Compression to Date other stuff other stuff Content Compression to Date other stuff other stuff [download] Output `P:\test>test2 Content Compression to Date found Content Compression to Date found Content Compression to Date found Content Compression to Date found Content Compression to Date found` [download] Notes: You will need to use the /g option on ALL the regexes, even though you are only looking for the first match each time in order for pos to be set. (I never quite understood that?). You will also need to do the substr thing to throw away everything already matched after each successful match. Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail	[reply] [d/l] [select]
Re: Re: multiple line regex by js1 (Monk) on Apr 23, 2004 at 09:52 UTC
Thanks for your replies. Both of them are very useful. js1.	[reply]