Maire has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I'm new to Perl, and I have run into a problem with a regex.

Essentially, I am working with HTML files that take the format of the extract exemplified below:

<blockquote> <p><b>Joos van Cleve</b> - Lucretia (detail)</p> </blockquote> <p>beautiful</p> </blockquote> <p>indeed I am</p> <footer><a href

I want to capture and print the contents of the line (minus the HTML tags) proceeding the <footer> line (so, in this case, I want to print the words "indeed I am"). I am using the following script to try and do this:

open(FILE, "C:/Users/li/data_collection/posts/165644996453.html"); while (<FILE>) { if ( /(?<=<p>)(.*)(?=<\/p>\s+<footer>)/s ) { print "$1\n"; } }

However, when I run the script, nothing is printed. I am almost certain that the error arises from the way in which I've tried to get the regex to work over multiple lines. I've tried several fixes that I've found on various websites, some of which are reproduced below, but nothing solves the problem

( /(?<=<p>)(.*)(?=<\/p><footer\>)/s ) ( /(?<=<p>)(.*)(?=<\/p>(<footer>))/m ) ( /(?<=<p>)(.*)(?=<\/p>\s+<footer>)/gm ) ( /(?<=<p>)(.*)(?=<\/p>\n<footer)/g )

Any advice would be greatly appreciated.

Cheers!

Replies are listed 'Best First'.
Re: Matching regular expression over multiple lines
by haukex (Archbishop) on Oct 15, 2017 at 09:53 UTC

    Welcome to Perl and the Monastery! The best thing to do is not use regexes to try to parse HTML. Here, I'm using Mojo::DOM, which is modern and fairly easy to use:

    #!/usr/bin/env perl use warnings; use strict; use Mojo::DOM; my $filename = 'C:/Users/li/data_collection/posts/165644996453.html'; # slurp the whole file into memory open my $fh, '<', $filename or die $!; my $html = do { local $/; <$fh> }; close $fh; my $dom = Mojo::DOM->new($html); my $text = $dom->find('footer')->last->previous->text; print $text,"\n"; # prints "indeed I am"

    The problem you're probably having with the regex in your solution is that while (<FILE>) is only reading one line at a time, but to match over multiple lines, you need to read multiple lines (or the whole file) into memory.

    Update: Just to make clear what's going on in that my $html line: $dom->find('footer') returns a list of <footer> elements (probably only one?), ->last picks the last one of those, ->previous goes one node back to the <p> element, and ->text gets the text content of that element.

      Thank you for the welcome and the very clear explanation! This worked brilliantly for me (and probably saved me a lot of time in the future).

      Just out of curiosity, I went back and tried to solve the original problem with the regex after your tip about the "while" element only reading one line. You were absolutely right, and I should have been writing the following:

      open( FILE, "C:/Users/li/data_collection/posts/165644996453.html" ) || + die "couldn't open\n"; while ( <FILE> ) { $data .= $_; } if ( $data =~ m/(?<=<p>)(.*)(?=<\/p>\s+<footer>)/g ) { print "$1\n"; }
      (code taken from dsb's answer in Re: Apply regex to entire file, not just individual lines ?).

      Thanks again!
        while ( <FILE> ) { $data .= $_; }

        That'll work, but it's not particularly efficient because it chops the file up line by line and then puts it back together. You could use the same "slurp" idiom I showed (do { local $/; <$fh> }), which will read the entire file in one go, which is more efficient.

        an alternative to using a regex [quoted from here]

        I just wrote about this in general here: Parsing HTML/XML with Regular Expressions

      I now predict a 5 levels deep nested discussion about which HTML parsing module is better/faster/more compliant. With at least 20 nodes, at least one of which will contain a benchmark and another one critisizing said benchmark.

      That's what I love about this site :-D


      holli

      You can lead your users to water, but alas, you cannot drown them.
Re: Matching regular expression over multiple lines
by holli (Abbot) on Oct 15, 2017 at 09:55 UTC
    Don't parse html with regexes. Seriously,

    DO NOT PARSE HTML WITH REGEXES

    Better men than us have tried that and failed. Use the right tool for the job:
    use HTML::TagParser; my $html = qq[ <blockquote> <p><b>Joos van Cleve</b> - Lucretia (detail)</p> </blockquote> <p>beautiful</p> </blockquote> <p>indeed I am</p> <footer> ]; my $parser = HTML::TagParser->new( $html ); print $parser->getElementsByTagName( "footer" )->previousSibling->inne +rText;


    holli

    You can lead your users to water, but alas, you cannot drown them.
      It depends, I agree with arbitrary HTML.

      But sometimes with simple output generated automatically - like pdftohtml - regex is the right tool.

      Cheers Rolf
      (addicted to the Perl Programming Language and ☆☆☆☆ :)
      Je suis Charlie!

      Thank you for the code and the tip. I didn't know that there was an alternative to using a regex, so this was incredibly helpful, thanks!

      Hi

      What you posted is the equivalent of using regex

      DO NOT PARSE HTML With low level parsers like HTML::TagParser

      Use a "DOM" like HTML::Tree / XML::Twig / XML::LibXML / Mojo::DOM...

        > with low level parsers like ...

        Please explain.

        Cheers Rolf
        (addicted to the Perl Programming Language and ☆☆☆☆ :)
        Je suis Charlie!

Re: Matching regular expression over multiple lines
by tybalt89 (Monsignor) on Oct 15, 2017 at 14:28 UTC
    #!/usr/bin/perl -l # http://perlmonks.org/?node_id=1201392 use strict; use warnings; local $/ = '<footer>'; open my $fh, '<', \<<END; <blockquote> <p><b>Joos van Cleve</b> - Lucretia (detail)</p> </blockquote> <p>beautiful</p> </blockquote> <p>indeed I am</p> <footer><a href END while(<$fh>) { /^(.*)\n.*<footer>/m and print $1 =~ s/<.*?>//gr; }
      Thank you!