Regex on HTML across multiple lines with WWW::Mechanize->content()

fixles has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I'm trying to extract some info from HTML where I need to identity it over muliple lines. The HTML is as follows.

                    <td class="Label" align="right">Last Login:</td>
                    <td>Yesterday</td>
[download]

I'm trying to extract the word Yesterday but whatever I try fails. Is there someway of matching the newline or the spaces? Printing a join of @TD = $mech->content() =~ m|<td>(.*?)</td>|g shows yesterday but also everything else within a TD tag. The HTML is indented by about 20 spaces like the above example code. Does anyone know how I could use a regex to extract just Yesterday from this HTML? Many Thanks, James

Comment on Regex on HTML across multiple lines with WWW::Mechanize->content() Select or Download Code

Replies are listed 'Best First'.
Re: Regex on HTML across multiple lines with WWW::Mechanize->content() by Corion (Patriarch) on Jul 25, 2011 at 18:36 UTC
Have you looked at perlre and what it says about newlines? Personally, I don't hand-parse HTML using regular expressions anymore. I use HTML::TreeBuilder::XPath together with HTML::Selector::XPath. There also is a TreeBuilder plugin, WWW::Mechanize::TreeBuilder. A CSS selector for your example could be `td.Label + td` [download] so the code for finding the relevant node(s) would be: `my $query = selector_to_xpath('td.Label + td'); my @nodes = $mech->content->findnodes($query); for (@nodes) { print $_->as_HTML; };` [download]	[reply] [d/l] [select]
Re: Regex on HTML across multiple lines with WWW::Mechanize->content() by onelesd (Pilgrim) on Jul 26, 2011 at 06:15 UTC
Use the s and/or m operator, documented in perlre: m Treat string as multiple lines. That is, change "^" and "$" from matching the start or end of the string to matching the start or end of any line anywhere within the string. s Treat string as single line. That is, change "." to match any character whatsoever, even a newline, which normally it would not match. Used together, as /ms, they let the "." match any character whatsoever, while still allowing "^" and "$" to match, respectively, just after and just before newlines within the string.	[reply]