James2000 has asked for the wisdom of the Perl Monks concerning the following question:

Hi All,

Is is possible to create a Perl script to list all of the <a>...</a> tags in an HTML file , including ones that span multiple text lines?

Is it possible to ONLY list the href="..." or href='...' sections of <a>...</a> tags?

Thanks your your help,

James

  • Comment on Listing all <a>...</a> tags in HTML file

Replies are listed 'Best First'.
Re: Listing all <a>...</a> tags in HTML file
by Fletch (Bishop) on Nov 29, 2007 at 18:25 UTC

    If you've got other interests besides the contents of href attributes then yes you want to look at one of the parsing modules (see also HTML::TreeBuilder), but if you just want a list of URLs out HTML::LinkExtor may need less scaffolding to get going.

    The cake is a lie.
    The cake is a lie.
    The cake is a lie.

Re: Listing all <a>...</a> tags in HTML file
by toolic (Bishop) on Nov 29, 2007 at 18:18 UTC
    It looks like HTML::TokeParser can do this. Look at the EXAMPLES section. There is even a tutorial here at the monastery on using this CPAN module.
Re: Listing all <a>...</a> tags in HTML file
by locked_user sundialsvc4 (Abbot) on Nov 29, 2007 at 18:46 UTC

    This is exactly one of those situations where you might start panicking at the thought of “regular-expression hell” ... until you see the light at the end of the tunnel:   CPAN!

    If you surf to http://search.cpan.org and type in “HTML::”, you will at this writing be rewarded with 4,408 hits! Well, maybe that's a bit much, so if instead you search for “HTML href parser” you get a mere 104 pre-written packages to choose from.

    So much for futzing with “regular-expression hell!” :-)

    So, now your task stops being one of trying to figure out, (as though you were the first person on the planet...) “how do I do this (from scratch)?” (Answer:   you don't!) Instead, you have this broad collection of high-level widgets to choose-from, and so now your dual questions become:   “which one of these is the best for my task?” and, “how do I use this?” Quite a difference.

    Generally, you'd like to find the most specific widget that seems to be most-focused upon your particular task. CPAN gives you a lot of that.

    Dictum Ne Agas:   Do Not Do A Thing Already Done!

    Incidentally... when I select and decide to use a CPAN module for an application-specific purpose, I still like to create a “my-application specific” package for use in my application. This package will encapsulate the “what, not how” of whatever my application is actually trying to do. In this way I compartmentalize my code into just one place, and I will clearly document what “my application” is doing. (Now I can say... “If you want to discover what that is, just perldoc the module. If you want to discover how we're doing it at the moment, read the module's source.”) If the first CPAN-module that I decided to employ isn't cutting the mustard anymore, I can re-implement just this one package so that it employs a different CPAN-module but provides the same services for my application as the previous version of this package did.

    Oops... let me clarify that thought...

    “My application-specific package” will use a CPAN-module to do the work ... that's the “how” ... but all of the mumbo-jumbo of actually doing that will be encapsulated into a package that is specific to my app. The rest of the app will use my package, while my package will in turn use the CPAN module to actually get the job done.

      sundialsvc4 wrote:
      Dictum Ne Agas:   Do Not Do A Thing Already Done!

      A corollary of the same principle:

      Wait long enough, and someone will write what you need.

      Sorry, although I took 4 years of it in high-school, I can't provide the Latin equivalent.

      Thanks all for quick responses,

      I think HTML::TokeParser will meet all my requirements (for example listing src=... sections in <img> tags), and I can avoid taking regular expressions route!

      Thanks again,

      James