Matching regular expression over multiple lines

Maire has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I'm new to Perl, and I have run into a problem with a regex.

Essentially, I am working with HTML files that take the format of the extract exemplified below:

 
<blockquote>
<p><b>Joos van Cleve</b> - Lucretia (detail)</p>
</blockquote>
<p>beautiful</p>
</blockquote>
<p>indeed I am</p>
<footer><a href
[download]

I want to capture and print the contents of the line (minus the HTML tags) proceeding the <footer> line (so, in this case, I want to print the words "indeed I am"). I am using the following script to try and do this:

 
open(FILE, "C:/Users/li/data_collection/posts/165644996453.html");  
while (<FILE>) {
    if ( /(?<=<p>)(.*)(?=<\/p>\s+<footer>)/s ) {
        print "$1\n";
    }
}
[download]

However, when I run the script, nothing is printed. I am almost certain that the error arises from the way in which I've tried to get the regex to work over multiple lines. I've tried several fixes that I've found on various websites, some of which are reproduced below, but nothing solves the problem

( /(?<=<p>)(.*)(?=<\/p><footer\>)/s  )
( /(?<=<p>)(.*)(?=<\/p>(<footer>))/m )
( /(?<=<p>)(.*)(?=<\/p>\s+<footer>)/gm )
( /(?<=<p>)(.*)(?=<\/p>\n<footer)/g )
[download]

Any advice would be greatly appreciated.

Cheers!

Comment on Matching regular expression over multiple lines Select or Download Code

Replies are listed 'Best First'.
Re: Matching regular expression over multiple lines by haukex (Archbishop) on Oct 15, 2017 at 09:53 UTC
Welcome to Perl and the Monastery! The best thing to do is not use regexes to try to parse HTML. Here, I'm using Mojo::DOM, which is modern and fairly easy to use: `#!/usr/bin/env perl use warnings; use strict; use Mojo::DOM; my $filename = 'C:/Users/li/data_collection/posts/165644996453.html'; # slurp the whole file into memory open my $fh, '<', $filename or die $!; my $html = do { local $/; <$fh> }; close $fh; my $dom = Mojo::DOM->new($html); my $text = $dom->find('footer')->last->previous->text; print $text,"\n"; # prints "indeed I am"` [download] The problem you're probably having with the regex in your solution is that `while (<FILE>)` is only reading one line at a time, but to match over multiple lines, you need to read multiple lines (or the whole file) into memory. Update: Just to make clear what's going on in that `my $html` line: `$dom->find('footer')` returns a list of `<footer>` elements (probably only one?), `->last` picks the last one of those, `->previous` goes one node back to the `<p>` element, and `->text` gets the text content of that element.	[reply] [d/l] [select]
Re^2: Matching regular expression over multiple lines by Maire (Scribe) on Oct 16, 2017 at 05:51 UTC
Thank you for the welcome and the very clear explanation! This worked brilliantly for me (and probably saved me a lot of time in the future). Just out of curiosity, I went back and tried to solve the original problem with the regex after your tip about the "while" element only reading one line. You were absolutely right, and I should have been writing the following: `open( FILE, "C:/Users/li/data_collection/posts/165644996453.html" ) \|\| + die "couldn't open\n"; while ( <FILE> ) { $data .= $_; } if ( $data =~ m/(?<=<p>)(.*)(?=<\/p>\s+<footer>)/g ) { print "$1\n"; }` [download] (code taken from dsb's answer in Re: Apply regex to entire file, not just individual lines ?). Thanks again!	[reply] [d/l]
Re^3: Matching regular expression over multiple lines by haukex (Archbishop) on Oct 16, 2017 at 11:59 UTC
`while ( <FILE> ) { $data .= $_; }` That'll work, but it's not particularly efficient because it chops the file up line by line and then puts it back together. You could use the same "slurp" idiom I showed (`do { local $/; <$fh> }`), which will read the entire file in one go, which is more efficient. an alternative to using a regex [quoted from here] I just wrote about this in general here: Parsing HTML/XML with Regular Expressions	[reply] [d/l] [select]
Re^4: Matching regular expression over multiple lines by Maire (Scribe) on Oct 16, 2017 at 14:26 UTC
Re^2: Matching regular expression over multiple lines by holli (Abbot) on Oct 15, 2017 at 10:03 UTC
I now predict a 5 levels deep nested discussion about which HTML parsing module is better/faster/more compliant. With at least 20 nodes, at least one of which will contain a benchmark and another one critisizing said benchmark. That's what I love about this site :-D holli You can lead your users to water, but alas, you cannot drown them.	[reply] [d/l]
Re: Matching regular expression over multiple lines by holli (Abbot) on Oct 15, 2017 at 09:55 UTC
Don't parse html with regexes. Seriously, DO NOT PARSE HTML WITH REGEXES Better men than us have tried that and failed. Use the right tool for the job: `use HTML::TagParser; my $html = qq[ <blockquote> <p><b>Joos van Cleve</b> - Lucretia (detail)</p> </blockquote> <p>beautiful</p> </blockquote> <p>indeed I am</p> <footer> ]; my $parser = HTML::TagParser->new( $html ); print $parser->getElementsByTagName( "footer" )->previousSibling->inne +rText;` [download] holli You can lead your users to water, but alas, you cannot drown them.	[reply] [d/l]
Re^2: Matching regular expression over multiple lines by LanX (Saint) on Oct 15, 2017 at 11:52 UTC
It depends, I agree with arbitrary HTML. But sometimes with simple output generated automatically - like pdftohtml - regex is the right tool. Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Je suis Charlie!}	[reply]
Re^2: Matching regular expression over multiple lines by Maire (Scribe) on Oct 16, 2017 at 05:53 UTC
Thank you for the code and the tip. I didn't know that there was an alternative to using a regex, so this was incredibly helpful, thanks!	[reply]
Re^2: Matching regular expression over multiple lines by Anonymous Monk on Oct 15, 2017 at 23:41 UTC
Hi What you posted is the equivalent of using regex DO NOT PARSE HTML With low level parsers like HTML::TagParser Use a "DOM" like HTML::Tree / XML::Twig / XML::LibXML / Mojo::DOM...	[reply]
Re^3: Matching regular expression over multiple lines by LanX (Saint) on Oct 16, 2017 at 01:52 UTC
> with low level parsers like ... Please explain. Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Je suis Charlie!}	[reply]
Re: Matching regular expression over multiple lines by tybalt89 (Monsignor) on Oct 15, 2017 at 14:28 UTC
`#!/usr/bin/perl -l # http://perlmonks.org/?node_id=1201392 use strict; use warnings; local $/ = '<footer>'; open my $fh, '<', \<<END; <blockquote> <p><b>Joos van Cleve</b> - Lucretia (detail)</p> </blockquote> <p>beautiful</p> </blockquote> <p>indeed I am</p> <footer><a href END while(<$fh>) { /^(.)\n.<footer>/m and print $1 =~ s/<.*?>//gr; }` [download]	[reply] [d/l]
Re^2: Matching regular expression over multiple lines by Maire (Scribe) on Oct 16, 2017 at 05:47 UTC
Thank you!	[reply]

DO NOT PARSE HTML WITH REGEXES