Maire has asked for the wisdom of the Perl Monks concerning the following question:
Hi Monks,
I'm new to Perl, and I have run into a problem with a regex.
Essentially, I am working with HTML files that take the format of the extract exemplified below:
<blockquote> <p><b>Joos van Cleve</b> - Lucretia (detail)</p> </blockquote> <p>beautiful</p> </blockquote> <p>indeed I am</p> <footer><a href
I want to capture and print the contents of the line (minus the HTML tags) proceeding the <footer> line (so, in this case, I want to print the words "indeed I am"). I am using the following script to try and do this:
open(FILE, "C:/Users/li/data_collection/posts/165644996453.html"); while (<FILE>) { if ( /(?<=<p>)(.*)(?=<\/p>\s+<footer>)/s ) { print "$1\n"; } }
However, when I run the script, nothing is printed. I am almost certain that the error arises from the way in which I've tried to get the regex to work over multiple lines. I've tried several fixes that I've found on various websites, some of which are reproduced below, but nothing solves the problem
( /(?<=<p>)(.*)(?=<\/p><footer\>)/s ) ( /(?<=<p>)(.*)(?=<\/p>(<footer>))/m ) ( /(?<=<p>)(.*)(?=<\/p>\s+<footer>)/gm ) ( /(?<=<p>)(.*)(?=<\/p>\n<footer)/g )
Any advice would be greatly appreciated.
Cheers!
|
|---|