Re: Matching regular expression over multiple lines

Welcome to Perl and the Monastery! The best thing to do is not use regexes to try to parse HTML. Here, I'm using Mojo::DOM, which is modern and fairly easy to use:

#!/usr/bin/env perl
use warnings;
use strict;
use Mojo::DOM;

my $filename = 'C:/Users/li/data_collection/posts/165644996453.html';

# slurp the whole file into memory
open my $fh, '<', $filename or die $!;
my $html = do { local $/; <$fh> };
close $fh;

my $dom = Mojo::DOM->new($html);
my $text = $dom->find('footer')->last->previous->text;
print $text,"\n";  # prints "indeed I am"
[download]

The problem you're probably having with the regex in your solution is that while (<FILE>) is only reading one line at a time, but to match over multiple lines, you need to read multiple lines (or the whole file) into memory.

Update: Just to make clear what's going on in that my $html line: $dom->find('footer') returns a list of <footer> elements (probably only one?), ->last picks the last one of those, ->previous goes one node back to the <p> element, and ->text gets the text content of that element.

Comment on Re: Matching regular expression over multiple lines Select or Download Code

Replies are listed 'Best First'.
Re^2: Matching regular expression over multiple lines by Maire (Scribe) on Oct 16, 2017 at 05:51 UTC
Thank you for the welcome and the very clear explanation! This worked brilliantly for me (and probably saved me a lot of time in the future). Just out of curiosity, I went back and tried to solve the original problem with the regex after your tip about the "while" element only reading one line. You were absolutely right, and I should have been writing the following: `open( FILE, "C:/Users/li/data_collection/posts/165644996453.html" ) \|\| + die "couldn't open\n"; while ( <FILE> ) { $data .= $_; } if ( $data =~ m/(?<=<p>)(.*)(?=<\/p>\s+<footer>)/g ) { print "$1\n"; }` [download] (code taken from dsb's answer in Re: Apply regex to entire file, not just individual lines ?). Thanks again!	[reply] [d/l]
Re^3: Matching regular expression over multiple lines by haukex (Archbishop) on Oct 16, 2017 at 11:59 UTC
`while ( <FILE> ) { $data .= $_; }` That'll work, but it's not particularly efficient because it chops the file up line by line and then puts it back together. You could use the same "slurp" idiom I showed (`do { local $/; <$fh> }`), which will read the entire file in one go, which is more efficient. an alternative to using a regex [quoted from here] I just wrote about this in general here: Parsing HTML/XML with Regular Expressions	[reply] [d/l] [select]
Re^4: Matching regular expression over multiple lines by Maire (Scribe) on Oct 16, 2017 at 14:26 UTC
Ah, that makes more sense! Thanks a lot.	[reply]
Re^2: Matching regular expression over multiple lines by holli (Abbot) on Oct 15, 2017 at 10:03 UTC
I now predict a 5 levels deep nested discussion about which HTML parsing module is better/faster/more compliant. With at least 20 nodes, at least one of which will contain a benchmark and another one critisizing said benchmark. That's what I love about this site :-D holli You can lead your users to water, but alas, you cannot drown them.	[reply] [d/l]