kachunyu has asked for the wisdom of the Perl Monks concerning the following question:

I have an HTML file that I want to search through looking for text inside tags of the form:
<div><ul class="test"> <li><a href="/www/?category=Earth">earth_text</a></li></div>
where I want to find and print "earth_text". This I can do with a one liner:
perl -lne 'BEGIN{undef $/} while (/<div><ul class="test">.*?<li><a hre +f=\"\/www\/\?category=.*?\">(.*?)<\/a><\/li></div>/sg){print $1}' ind +ex.html
However the file also has cases where there are multiple entries; for example:
<div><ul class="test"> <li><a href="/www/?category=Earth">earth_text</a></li> <li><a href="/www/?category=Space">space_text</a></li></div>
Is there a way to find one or multiple instances of <li>...</li>? That is, since they are associated with the same code block, I want perl to return $1="earth_text" and $2="space_text" in the above example and spit out the text like:
print $1 . "; " . $2
Ideally this should work for an arbitrary number of matches including 1.

Replies are listed 'Best First'.
Re: searching for one or more instances of text between tags
by choroba (Cardinal) on Oct 13, 2015 at 08:52 UTC
    Don't use regular expressions to parse HTML. Use a tool that already knows how to parse HTML. For example, here's how you can get the desired output in XML::XSH2, a wrapper around XML::LibXML:
    open :F html file.html ; echo xsh:join('; ', //ul[@class='test']/li/a) ;
    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
Re: searching for one or more instances of text between tags
by Skeeve (Parson) on Oct 13, 2015 at 08:14 UTC

    For sure it should be possible. Not using one regexp though.

    But I think, you'd need to specify better what you regard "the same code block".

    But most importantly: Try consider using one of the modules created for parsing HTML/XML for tasks like this. I have learned that, in the long run, it is worth the additional effort of installing and learning it.


    s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
    +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e
Re: searching for one or more instances of text between tags
by AppleFritter (Vicar) on Oct 13, 2015 at 09:47 UTC

    I'd suggest using XPath for that; this is exactly the sort of thing it was made for. There's a decent module on CPAN, XML::XPath; here's the XPath spec (and a newer one that came out while I wasn't looking, apparently), and here's a tutorial introducing and discussing XPaht selector syntax.

    This approach may not work if your HTML isn't well-formed, though; in your example, for instance, there's no </ul> tags.

Re: searching for one or more instances of text between tags
by Jenda (Abbot) on Oct 13, 2015 at 14:28 UTC

    In the fairly likely case that your HTML is not valid XML, use HTML::Parser.

    Jenda
    Enoch was right!
    Enjoy the last years of Rome.

Re: searching for one or more instances of text between tags
by tangent (Parson) on Oct 13, 2015 at 15:02 UTC
    Here's a way to do it using HTML::TreeBuilder::XPath
    use HTML::TreeBuilder::XPath; my $html = '<div><ul class="test"> <li><a href="/www/?category=Earth">earth_text</a></li> <li><a href="/www/?category=Space">space_text</a></li></div>'; my $tree = HTML::TreeBuilder::XPath->new; $tree->parse($html); $tree->eof; my @values = $tree->findvalues('//li/a'); print "$_\n" for @values;
    Or if you need more specific selection, or other info returned
    my @links = $tree->findnodes('//ul[@class="test"]/li/a') ; for my $link (@links) { print $link->attr('href'), "\n"; print $link->as_text, "\n"; }