Allasso has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I am looking for a way to extract links from html documents. Have been working with HTML::Parser and HTML::LinkExtor, which do fine for what they do, however, I am finding that the list of links they parse is incomplete. For example, the links found in:
<style> background: url('images/bg_content.jpg') </style> <style type="text/css" media="screen">@import "../style.css";</style> <script type="text/javascript"> ... s1.addParam("flashvars","file=testmovie.flv&image=../media/flash/previ +ews/testmovie_prvw.jpg"); ... </script>
are ignored.

If someone can either:

point out what I am missing in order to use HTML::LinkExtor or HTML::Parser the way I want.

suggest another module that they know to be more robust (so I don't have to keep trying them all)

point me to a resource that would provide a list of all possible situations in an HTML document in which a link can occur (whether strictly standard or quirksmode), in which I can write my own.

I will post code if desired, but didn't feel it was necessary to for the question as such.

Replies are listed 'Best First'.
Re: More robust link finding than HTML::LinkExtor/HTML::Parser?
by ww (Archbishop) on May 07, 2011 at 21:56 UTC
    H::LE and H:P skipped what you seem to be calling a link in the code beginning at line 8 because it isn't an HTML link.

    Line 8 is a declaration that what follows -- until the final </script> -- is to be handled by javascript.

    As to the first two, I tend to lean to Corion's view: they're being handled with css (initially); neither is a simple HTML link... which (without exception that I can think of OTTOMH) implies that the address/filename will be where I have elipsis in a construct like:

    <a href="...">rendered_link_Label_here</a>
        or an
    <img src="address...filename.typ">
        or similar.

    Perhaps you should explore for modules which will chase down css and js... or perhaps, depending on your actual goal, you don't need to worry about the stylesheet or flash sources, etc.

      ...because it isn't an HTML link.

      I just called it a "link", meaning something that links to another file. Is there is a more appropriate name to call it when it appears in something other than an HTML tag?
        I think the problem is context. Yes, you called it simply "a link" but you did so in the context of purported failures by two HTML-oriented modules.

        Just as you probably wouldn't want to use a fishing net to dig potatoes, the links for which those modules fish are HTML links; rooting around in javascript or styling links with CSS requires a different tool.

        I am unaware of any alternate name or word; I think the solution is to be cautious on your context.

Re: More robust link finding than HTML::LinkExtor/HTML::Parser?
by Corion (Patriarch) on May 07, 2011 at 20:08 UTC

    I would say that these links fall outside the realm of "HTML" and more into the realm of "CSS". Scanning through the CSS namespace, I don't see any module that immediately seems to provide a list of outside resources linked to by the CSS though.

Re: More robust link finding than HTML::LinkExtor/HTML::Parser?
by Allasso (Monk) on May 07, 2011 at 22:12 UTC
    thanks for the input, yes, it sounds reasonable. I am indeed needing to know all outside references, so CSS and Javascript is important to me. I am looking into CSS::SAS at the moment, and will report if I find any breakthroughs (Lord willing)
      I think the only way is to load it in a browser and let it work itself through normally. Various add-ins offer to show all the media on a page; you could write an add-in that accessed that list and returned it programmatically rather than displaying it in a GUI.
Re: More robust link finding than HTML::LinkExtor/HTML::Parser?
by Anonymous Monk on May 08, 2011 at 03:18 UTC
      HTML::LinkExtor / HTML::Parser are robust. They do a different job, but they are robust. Implying they aren't robust is poor form.

      Yes, I agree. I was not mindful of the wording of my question.
      Thank you for the links.

      I wish to have a script that works independently of a browser. So I don't think WWW::Mechanize::Firefox will work for me, unless you were seeing a way that I could utilize this to come up with code for a script that works independently of Firefox. If so, please let me know.

      The second link looks more promising, now I just have to try to figure out what Mozilla is doing here :-)

      I believe that HTML::LinkExtor will work fine for extracting the links in the HTML robustly :-); I just need now to find a way to extract them from CSS and JS.
        The second link looks more promising, now I just have to try to figure out what Mozilla is doing here :-)

        The second link is for use with WWW::Mechanize::Firefox.

        You need some kind of browser, something to interpret the javascript, there is no way around that.

        The other candidate is WWW::Scripter, a WWW::Mechanize subclass, but its alpha version, and my simple test didn't yield anything useful, :)

        My other thought was go straight for the supporting module CSS::DOM, but that didn't work out. Same goes for CSS/CSS::SAC/CSS::Tiny.

        I figure this ought to be robust enough for css

        ## http://cpansearch.perl.org/src/NEVESENIN/CSS-Packer-1.000001/lib/CS +S/Packer.pm our $DICTIONARY = { 'STRING1' => qr~"(?>(?:(?>[^"\\]+)|\\.|\\"|\\\s)*)"~, 'STRING2' => qr~'(?>(?:(?>[^'\\]+)|\\.|\\'|\\\s)*)'~ }; our $URL = 'url\(\s*(' . $DICTIONARY->{STRING1} . '|' . $DI +CTIONARY->{STRING2} . '|[^\'"\s]+?)\s*\)'; our $IMPORT = '\@import\s+(' . $DICTIONARY->{STRING1} . '|' . +$DICTIONARY->{STRING2} . '|' . $URL . ')([^;]*);';