Krambambuli has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks,

I've just worked out a solution for a specific problem I had to solve, and I'd like to share the results.

However, I'm unsure about two things:
My questions:
  1. Should I post/submit this code somewhere ?
  2. If yes: what would be the recommended way to proceed ?
Specifically, I wanted to be able to extract links _and_ the "associated text" - even if that is a fairly vague notion.

So I decided to write the code that, given the following HTML (note the intentionally nested links and somewhat sloppy HTML):
<garbage> <a href="URL1"> text1.0 <img src="SRC1"> text1.1 <br> <garbage2> text1.2 <a href="URL2"> text2.0 <img src="SRC2"> text2.1 <br> <garbage2> <garbage3> text2.2 <a href="URL3"> text3.0 <img src="SRC3"> text3.1 <br> <garbage3> text3.2 <x1 a="a" b="b"> text3.3 </a> text2.3 <haha> </a> text1.3 <oho> <a href="URL4"> text4.0 <img src="SRC4"> text4.1 <br> <garbage4> text4.2 <x1 a="a" b="b"> text4.3 </a> text1.4 </a>
would produce the following (or similar) output:
A URL3 TEXT: ' text3.0 text3.1 text3.2 text3.3 ' A URL2 TEXT: ' text2.0 text2.1 text2.2 text2.3 ' A URL4 TEXT: ' text4.0 text4.1 text4.2 text4.3 ' A URL1 TEXT: ' text1.0 text1.1 text1.2 text1.3 text1.4 '
I did my search homeworks and found a series of good nodes; among them,

Extracting full links from HTML,
Process a HTML file to get information from it.,
Getting to grips with HTML::Parser.

I decided to go for a first attempt with HTML::Parser; I probably will continue and add at least a few of other code variants, as I'm curious to get my own hands-on knowledge about HTML::LinkExtractor and HTML::TokeParser.

The code I've came up so far is on my Scratchpad.

Thank you for any help/suggestions/opinions.

Replies are listed 'Best First'.
Re: Parsing HTML - once again
by GrandFather (Saint) on May 09, 2007 at 21:16 UTC

    You may also be interested in HTML::TreeBuilder. Consider:

    use warnings; use strict; use HTML::TreeBuilder; my $html = <<'HTML'; <a href="URL1"> text1.0 <img src="SRC1"> text1.1 <br> <garbage2> text1.2 <a href="URL2"> text2.0 <img src="SRC2"> text2.1 <br> <garbage2> <garbage3> text2.2 <a href="URL3"> text3.0 <img src="SRC3"> text3.1 <br> <garbage3> text3.2 <x1 a="a" b="b"> text3.3 </a> text2.3 <haha> </a> text1.3 <oho> <a href="URL4"> text4.0 <img src="SRC4"> text4.1 <br> <garbage4> text4.2 <x1 a="a" b="b"> text4.3 </a> text1.4 </a> HTML my $tree = HTML::TreeBuilder->new_from_content ($html); for my $elt ($tree->look_down ('_tag', 'a')) { print "A " . $elt->attr ('href') . "\n\tTEXT: '"; my @text_segs; for my $child ($elt->content_list ()) { next if ref $child and $child->{_tag} ne 'a'; last if ref $child; push @text_segs, $child; } print "$_ " for @text_segs; print "\n"; }

    Prints:

    A URL1 TEXT: ' text1.0 text1.1 text1.2 A URL2 TEXT: ' text2.0 text2.1 text2.2 A URL3 TEXT: ' text3.0 text3.1 text3.2 text3.3 A URL4 TEXT: ' text4.0 text4.1 text4.2 text4.3

    DWIM is Perl's answer to Gödel
      Thank you for this one; I forgot mentioning HTML::Treebuilder, but I had it in mind too.

      I hope to get the time to add the other variations too; once that done, I'll want to uniformize style about the different approaches and then comment a bit about pros or cons for each.

      I'll probably avoid benchmarking (reasoning about No More Meaningless Benchmarks!)- not sure yet. In the end, I hope to have collected together a few code samples that might be a goot reading for all those that step into the html parsing task.

      I'm still looking to find a good title ("Parsing HTML' ?) and a good way to place the whole thing in the end. I'm tempted to make it a set of linked nodes in 'Code catacombs', but I'm unsure yet.

      Thanks again.
Re: Parsing HTML - once again
by naikonta (Curate) on May 09, 2007 at 15:44 UTC
    Code like this is usually a candidate for Code Catacombs, or Cool Uses for Perl. I appreciate your intention to share. But what your program actually does?

    Open source softwares? Share and enjoy. Make profit from them if you can. Yet, share and enjoy!

      Well, about what the program does.. hmm, I really hoped that showing the input, saying that the program parses it and showing the output would say much more then I'd be able to say in words :)

      So: I want to be able to extract links from HTML _and_ associate them with a reasonable link_text, even if the links are nested and the text that should pertain to a given link is interspersed with possible various other tags, even if the links are nested.

      In order to do so, I wanted to explore the modules that would allow this sort of operation as easiest as possible - in order to get a good start the next time I'd have to do something similar, so that I'd know better the strenghts and weaknesses of the available tools.

      HTML::Parser would be Way#1; having the same problem solved with alongside Way#2 (HTML::LinkExtractor), Way#3 (HTML::TokeParser), etc. would allow to get an easy overview/comparison on the available tools/modules.

        Well, about what the program does.. hmm, I really hoped that showing the input, saying that the program parses it and showing the output would say much more then I'd be able to say in words :)
        Well, it's too obvious I needed to ask. You want to share your effort, fine. But what peculiarity you expect people would find in the context of sharing? I'm not saying your effort is useless, not at all. Just that you asked in what part of PM your code should be posted.

        So if you want to seek advices on what you've done, SoPW is the place and that's what you did. And I did suggested two other places. Again, my question was meant to help you clarify to yourself where to post.


        Open source softwares? Share and enjoy. Make profit from them if you can. Yet, share and enjoy!