in reply to Matching regular expression over multiple lines
Welcome to Perl and the Monastery! The best thing to do is not use regexes to try to parse HTML. Here, I'm using Mojo::DOM, which is modern and fairly easy to use:
#!/usr/bin/env perl use warnings; use strict; use Mojo::DOM; my $filename = 'C:/Users/li/data_collection/posts/165644996453.html'; # slurp the whole file into memory open my $fh, '<', $filename or die $!; my $html = do { local $/; <$fh> }; close $fh; my $dom = Mojo::DOM->new($html); my $text = $dom->find('footer')->last->previous->text; print $text,"\n"; # prints "indeed I am"
The problem you're probably having with the regex in your solution is that while (<FILE>) is only reading one line at a time, but to match over multiple lines, you need to read multiple lines (or the whole file) into memory.
Update: Just to make clear what's going on in that my $html line: $dom->find('footer') returns a list of <footer> elements (probably only one?), ->last picks the last one of those, ->previous goes one node back to the <p> element, and ->text gets the text content of that element.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Matching regular expression over multiple lines
by Maire (Scribe) on Oct 16, 2017 at 05:51 UTC | |
by haukex (Archbishop) on Oct 16, 2017 at 11:59 UTC | |
by Maire (Scribe) on Oct 16, 2017 at 14:26 UTC | |
|
Re^2: Matching regular expression over multiple lines
by holli (Abbot) on Oct 15, 2017 at 10:03 UTC |