ww has asked for the wisdom of the Perl Monks concerning the following question:

SOLVED. See reply to self below

There perhaps should be a question mark in the title.

Caveat: Fri nite & Sat am brainlock? Maybe. But...

I'm trying to extract from the output of a linkchecker all chunks which report errors. My problem? This minimal test:

#! /usr/bin/perl -w use strict; use 5.018; # test errmsg match # sample (and partial; see the chunking in the next code) errmsgs from + file: # Result</td><td bgcolor="#db4930">Error: 404....</td> ( or 301 etc.) # Result</td><td bgcolor="#db4930">Error: SSLError: [Errno 1] _ssl.c:5 +04:....</td> my $errmsg = qr[Result</td><td bgcolor=".{7}">Error:.*?(?=</td>)]; my @data_sample = ( '<tr><td bgcolor="#db4930">Result</td><td bgcolor="#db4930">Error: + 404 Not Found</td></tr>', '<tr><td bgcolor="#db4930">Result</td><td bgcolor="#db4930">Error: + SSLError: [Errno 1] _ssl.c:504: error:14090086:SSL routines:SSL3_GET +_SERVER_CERTIFICATE:certificate verify failed</td></tr>', '<tr><td foo bar baz> abcde </td></tr>' ); my $data_line; for $data_line(@data_sample) { if ( $data_line =~ /$errmsg/ ) { say "\t FOUND IT: $data_line \n"; } else { say "\t NO MATCH ON $data_line \n"; } }
    produces the expected results:
C:\>test_err_finder.pl FOUND IT: <tr><td bgcolor="#db4930">Result</td><td bgcolor="# +db4930">Error: 404 Not Found</td></tr> FOUND IT: <tr><td bgcolor="#db4930">Result</td><td bgcolor="# +db4930">Error: SSLError: [Errno 1] _ssl.c:504: error:14090086:SSL rou +tines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed</td></tr> NO MATCH ON <tr><td foo bar baz> abcde </td></tr> C:\>

whereas, the selfsame ( $errmsg) regex here:

#!/usr/bin/perl -w use strict; use 5.018; # find linkchecker error reports in html report, linkchecker-out201511 +20.html $/ = '<table align="left" border="0" cellspacing="0" cellpadding="1"'; my ($fh, $item, @erritems); my $trterminator = qr[</tr>\n</table></td></tr></table>]; # errmsg from file: # Result</td><td bgcolor="#db4930">Error: 404....</td> ( or 301 etc.) # Result</td><td bgcolor="#db4930">Error: SSLError: [Errno 1] _ssl.c:5 +04:....</td> my $errmsg = qr[Result</td><td bgcolor=".{7}">Error:.*?(?=</td>)]; my $eot = qr[</small></body></html>]; open ($fh, "<", 'linkchecker-out20151120.html') or die "Can't open, $! +"; while (<$fh> ) { if ( $_ =~ /$eot/ ) { last; } else { $_ = <$fh>; $item = $_; $item =~ s/\n//gs; $item .= "\n\n"; } if ( $item =~ /$errmsg/ ) { push @erritems, $item; } } say "Errors id'ed in LinkChecker output, 'linkchecker-out20151120.html +'\n"; for $_(@erritems) { print $/; say $_; }

    catches the SSL issues BUT FAILS TO OUTPUT THE '404' ERRORS (of which there is exactly one in the linkchecker log!

Many "print ($var);" debugging items have been removed here, but all point to consistency between the actual file contents and the minimal test above. Thus, contrary to the usual wise advice to include data, I'm omitting it for now, since even individual chucks run to about 0.5KB and even three samples (out of approximately 1000 chunks) would extend this verbose query to "TL,DR" status.

I'm hoping fresh eyes or greater wisdom will spot what I'm missing.

Replies are listed 'Best First'.
Re: Example of inconsistent regex matching
by RichardK (Parson) on Nov 21, 2015 at 16:33 UTC

    Parsing HTML by hand is always fragile, and in my experience never ends well. Have you considered a module like HTML::TreeBuilder and then use its look_down method to select just the elements you're interested in?

      Having expressed that wisdom myself, I blush to admit I expected to write a quick and dirty one-off for a job that -- while tedious -- I could have done in far less time manually, and, undoubtedly, were I familiar with the HTML::TreeBuilder or some of it's cousins, with one of those.

      But the truly horrid html (tables nested in tables; etc.) also disinclined me to attack it with a tool that pretty much expects (and I understand it; pls correct me if this is wrong) more-or-less standards-compliant source.

      In short, + + ; thank you ... but at this point, I'm far more interested in the apparently regex anomaly than in re-writing code that might get used as often as annually.


      ++$anecdote ne $data

        What's wrong with nested tables?
        The way forward always starts with a minimal test.
Re: Example of inconsistent regex matching
by AnomalousMonk (Archbishop) on Nov 21, 2015 at 16:02 UTC

    while (<$fh> ) { if ( $_ =~ /$eot/ ) { last; } else { $_ = <$fh>; $item = $_; $item =~ s/\n//gs; $item .= "\n\n"; } if ( $item =~ /$errmsg/ ) { push @erritems, $item; } }
    Maybe I'm just missing something too, and I certainly don't understand the structure of your data file, but here's what I understand of the code fragment above:
    • read a line | paragraph;
    • test if the line | paragraph has an  $eot match and exit the loop (and error log processing) if so;
    • if the line | paragraph is not an  $eot match, throw it away and read the next line | paragraph from the file (update: which is not tested for an  $eot match);
    • do some newline massaging on the line | paragraph;
    • push the (massaged) line | paragraph to the  @erritems array if it is an  $errmsg match.
    What I don't understand is why you're throwing away half the lines | paragraphs in the file up to the point of the  $eot line | paragraph match.

    Update: I just noticed you're reading the file in paragraph mode
        $/ = '<table align="left" border="0" cellspacing="0" cellpadding="1"';
    and so changed all instances of line(s) to paragraph(s).


    Give a man a fish:  <%-{-{-{-<

Re: Example of inconsistent regex matching
by Athanasius (Archbishop) on Nov 22, 2015 at 04:03 UTC

    Hello ww,

    The $errmsg regex is working fine, as the following demonstrates:

    #! perl -w use strict; use 5.018; my $start = qr{<table align="left" border="0" cellspacing="0" cellpad +ding="1"}; my $end = qr{</table>}; my $errmsg = qr{Result</td><td bgcolor=".{7}">Error:.*?(?=</td>)}; while (<DATA>) { /$errmsg/ && say if /$start/ .. /$end/; } __DATA__ <html><body><small> <table align="left" border="0" cellspacing="0" cellpadding="1"> <tr><td bgcolor="#db4930">Result</td><td bgcolor="#db4930">Error: 404 +Not Found</td></tr> </table> <table align="left" border="0" cellspacing="0" cellpadding="1"> <tr><td bgcolor="#db4930">Result</td><td bgcolor="#db4930">Error: SSLE +rror: [Errno 1] _ssl.c:504: error:14090086:SSL routines:SSL3_GET_SERV +ER_CERTIFICATE:certificate verify failed</td></tr> </table> <table align="left" border="0" cellspacing="0" cellpadding="1"> <tr><td foo bar baz> abcde </td></tr> </table> <tr><td bgcolor="INVALID">Result</td><td bgcolor="#db4930">Error: 404 +Not Found</td></tr> </small></body></html>

    Output:

    13:55 >perl 1458_SoPW.pl <tr><td bgcolor="#db4930">Result</td><td bgcolor="#db4930">Error: 404 +Not Found</td></tr> <tr><td bgcolor="#db4930">Result</td><td bgcolor="#db4930">Error: SSLE +rror: [Errno 1] _ssl.c:504: error:14090086:SSL routines:SSL3_GET_SERV +ER_CERTIFICATE:certificate verify failed</td></tr> 13:55 >

    The main limitation of the above approach is that it fails to handle nested tables.

    As AnomalousMonk and tye have indicated, the problem almost certainly lies in the logic used to split the input into “paragraphs.” If I were debugging this, I’d begin by printing out the value of $item immediately before the line if ( $item =~ /$errmsg/ ) {.

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Re: Example of inconsistent regex matching (., /s)
by tye (Sage) on Nov 21, 2015 at 19:04 UTC

    In a Perl regex, . doesn't match the newline character by default.

    - tye        

      Thank you, tye... but "brevity" and "baffled" both begin with "b" and I don't understand your observation. Have I missed such a use? I've removed all the newlines internal to each chunk (paragraph) and replaced them with a pair of \n (at Ln29-30) before applying the regex (of so I think).

        My second guess would be that you are only reporting one match per paragraph and you don't look for a match in the paragraph that contains $eot. So one of the places where you think there is a "\n\n", there might actually be a "\n \n" or even "\n\r\n", which won't count as a paragraph boundary to Perl. The missing error could be clumped together with its prior error in a single paragraph that looks like 2 paragraphs. Or it could be similarly clumped with the $eot.

        And it would be easier to you to debug the situation than for me to make guesses remotely.

        - tye        

Re: Example of brainfog (Was: inconsistent regex matching)
by ww (Archbishop) on Nov 22, 2015 at 15:35 UTC

    First, thanks to all who tried to help... and especially to those (most of you) of nailed the problem immediately, pointing out that the way I used the input record separator, $/, had logical problems, as AnomalousMonk, tye (second guess) and Athanasius noted.

    The solution involved simply moving - as an elsif... - the /$eot/ test down below the regex looking for error messages (Ln33-34 in the OP's "whereas...." code block).

    ... and for anyone still reading, mea culpa, I should have noted (in the OP!) that the html in the raw data horribly and unnecessarily convoluted and that the /$eot/ sequence occurs only once at the very end of the data file. Its only utility is to supress an inconsequential warning. Also,  my $trterminator = qr[</tr>\n</table></td></tr></table>]; is never used (it appears only once but that did not produce any warnings, an occurance suggesting I better review the docs).

    Still, despite the shortcomings of the OP, Monks and the Monastery came thru in "class A" fashion. Again, thanks!

    check Ln42!

      Also,  my $trterminator = qr[</tr>\n</table></td></tr></table>]; is never used (it appears only once but that did not produce any warnings, an occurance suggesting I better review the docs).

      Interesting. I noticed that my $trterminator is never used, but overlooked the fact that no warning is generated. Turns out, that’s the expected behaviour. From perldiag:

      Name "%s::%s" used only once: possible typo

      (W once) Typographical errors often show up as unique variable names. If you had a good reason for having a unique name, then just mention it again somehow to suppress the message. The our declaration is also provided for this purpose.

      NOTE: This warning detects package symbols that have been used only once. This means lexical variables will never trigger this warning.

      Related threads:

      I’ve learnt something! Also warnings::unused could be a helpful addition to the toolbox:

      13:02 >perl -wE "my $x = 42; say 'hi';" hi 13:02 >perl -Mwarnings::unused -wE "my $x = 42; say 'hi';" hi Unused variable my $x at -e line 1. 13:02 >

      Hope that helps,

      Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,