searching for one or more instances of text between tags

kachunyu has asked for the wisdom of the Perl Monks concerning the following question:

I have an HTML file that I want to search through looking for text inside tags of the form:

<div><ul class="test">                 
  <li><a href="/www/?category=Earth">earth_text</a></li></div>
[download]

where I want to find and print "earth_text". This I can do with a one liner:

perl -lne 'BEGIN{undef $/} while (/<div><ul class="test">.*?<li><a hre
+f=\"\/www\/\?category=.*?\">(.*?)<\/a><\/li></div>/sg){print $1}' ind
+ex.html
[download]

However the file also has cases where there are multiple entries; for example:

<div><ul class="test">                 
  <li><a href="/www/?category=Earth">earth_text</a></li>
  <li><a href="/www/?category=Space">space_text</a></li></div>
[download]

Is there a way to find one or multiple instances of <li>...</li>? That is, since they are associated with the same code block, I want perl to return $1="earth_text" and $2="space_text" in the above example and spit out the text like:

print $1 . "; " . $2
[download]

Ideally this should work for an arbitrary number of matches including 1.

Comment on searching for one or more instances of text between tags Select or Download Code

Replies are listed 'Best First'.
Re: searching for one or more instances of text between tags by choroba (Cardinal) on Oct 13, 2015 at 08:52 UTC
Don't use regular expressions to parse HTML. Use a tool that already knows how to parse HTML. For example, here's how you can get the desired output in XML::XSH2, a wrapper around XML::LibXML: `open :F html file.html ; echo xsh:join('; ', //ul[@class='test']/li/a) ;` [download] لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply] [d/l]
Re: searching for one or more instances of text between tags by Skeeve (Parson) on Oct 13, 2015 at 08:14 UTC
For sure it should be possible. Not using one regexp though. But I think, you'd need to specify better what you regard "the same code block". But most importantly: Try consider using one of the modules created for parsing HTML/XML for tasks like this. I have learned that, in the long run, it is worth the additional effort of installing and learning it. `s$$([},&%#}/&/]+}%&{});#$&&s&&$^X.($'^"%]=\&(\|?{%` `+`.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e	[reply] [d/l] [select]
Re: searching for one or more instances of text between tags by AppleFritter (Vicar) on Oct 13, 2015 at 09:47 UTC
I'd suggest using XPath for that; this is exactly the sort of thing it was made for. There's a decent module on CPAN, XML::XPath; here's the XPath spec (and a newer one that came out while I wasn't looking, apparently), and here's a tutorial introducing and discussing XPaht selector syntax. This approach may not work if your HTML isn't well-formed, though; in your example, for instance, there's no `</ul>` tags.	[reply]
Re: searching for one or more instances of text between tags by Jenda (Abbot) on Oct 13, 2015 at 14:28 UTC
In the fairly likely case that your HTML is not valid XML, use HTML::Parser. Jenda Enoch was right! Enjoy the last years of Rome.	[reply]
Re: searching for one or more instances of text between tags by tangent (Parson) on Oct 13, 2015 at 15:02 UTC
Here's a way to do it using HTML::TreeBuilder::XPath `use HTML::TreeBuilder::XPath; my $html = '<div><ul class="test"> <li><a href="/www/?category=Earth">earth_text</a></li> <li><a href="/www/?category=Space">space_text</a></li></div>'; my $tree = HTML::TreeBuilder::XPath->new; $tree->parse($html); $tree->eof; my @values = $tree->findvalues('//li/a'); print "$_\n" for @values;` [download] Or if you need more specific selection, or other info returned `my @links = $tree->findnodes('//ul[@class="test"]/li/a') ; for my $link (@links) { print $link->attr('href'), "\n"; print $link->as_text, "\n"; }` [download]	[reply] [d/l] [select]