dbarron has asked for the wisdom of the Perl Monks concerning the following question:

Ok, I need to parse webpages (that I wrote, but has been modified by others), and extract pertinent information stored within <div> </div> tags of class 'listing'. I'll list a sample entry below and then detail what I wish to parse out of it, with the format of another sample. Hopefully this will make sense and I'll gladly accept any advise as to which modules to use to make this easier.
<div class="listing"> Agave parryi&nbsp;&nbsp;&nbsp;&nbsp; <span style="font-weight: normal;">Parry's agave</span> <br>$20.00&nbsp; 3 quart&nbsp;&nbsp;&nbsp; $12.00 Quart <br><span id="native">Native</span>&nbsp;&nbsp;&nbsp; Sun to part shade&nbsp; Zones 5-10&nbsp; Family: <i>Amaryllidaceae</i> <br>From the Southwest comes this lovely agave.&nbsp; Thick spiny leaves adorn this hardy agave.&nbsp; Ultimate clump size is about 36" with each leaf being maybe 5" across. The flower stalk can reach 12 feet tall. Please plant in well drained soil in a place where children don't play. <span id="hummingbird">Hummingbirds</span> </div>
Ok, what I'd like to get out of this (and there's a lot more html junk around it to ignore) is:
Latin name (ie agave parryi)
Common name (Parry's agave)
Pot price ($20.00)
Pot size (3 quart)
Pot price ($12.00)
Pot size (quart)
Origin: Native
Exposure: Sun to part shade
Hardiness: 5-10
Family: Amaryllidacea
Text description:From the Southwest comes this lovely agave.  Thick spiny leaves adorn this hardy agave.  Ultimate clump size is about 36" with each leaf being maybe 5" across. The flower stalk can reach 12 feet tall. Please plant in well drained soil in a place where children don't play.
Special Features: Hummingbirds (there's others of those...but I can handle generalization (I think))
Ok, sorry for such a long post...but I wanted to give a good thorough example.

Replies are listed 'Best First'.
Re: HTML Parsing (ick)
by tangent (Parson) on Aug 20, 2014 at 04:01 UTC
    This little snippet should get you most of the way:
    use HTML::TreeBuilder::XPath; my $html = q|<div class="listing"> Agave parryi&nbsp;&nbsp;&nbsp;&nbsp; ... </div>|; $html =~ s/&nbsp;/ /g; my $tree = HTML::TreeBuilder::XPath->new_from_content($html); my @nodes = $tree->findnodes('//div[@class="listing"]'); for my $node (@nodes) { my @contents = $node->content_list; for my $content (@contents) { if (ref $content) { my $text = $content->as_text or next; my $tag = $content->tag; print "<$tag> $text\n\n"; } else { print "$content\n\n"; } } }
    Output:
    Agave parryi <span> Parry's agave $20.00 3 quart $12.00 Quart <span> Native Sun to part shade Zones 5-10 Family: <i> Amaryllidaceae From the Southwest comes this lovely agave. Thick spiny leaves adorn t +his hardy agave. Ultimate clump size is about 36" with each leaf bein +g maybe 5" across. The flower stalk can reach 12 feet tall. Please pl +ant in well drained soil in a place where children don't play. <span> Hummingbirds
      Ah....excellent, and my thanks Tangent. Now, I'll see if I can get the TreeBuilder module installed under Windows. I couldn't get one of it's dependencies compiled cleanly under Linux, I think maybe Html::entities.
        Ok, after beating the dead horse (cpan) a bit...and manually doing makes and make installs....I have your test program running and ready to expand it! Thanks again.
Re: HTML Parsing (ick)
by Anonymous Monk on Aug 19, 2014 at 20:27 UTC
      Yes, I did..but I wondered if I was choosing appropriately using HTML::Parser, as (for instance) the description element was apparently not retrievable via it's methods.
      In that case, It seemed I might as well write it all from scratch (and I didn't want to do that). I used to be fairly good with PERL, but I haven't really used it in about five years (change of occupations, lifestyle, etc). Thus, I was trolling the waters looking to see if anyone had some good suggestions besides what I'd recently found.
Re: HTML Parsing (ick)
by Anonymous Monk on Aug 19, 2014 at 22:02 UTC
Re: HTML Parsing (ick)
by Anonymous Monk on Aug 19, 2014 at 21:19 UTC

    In addition to parsing the HTML, it looks like you'll need some regular expressions to pull the text elements out of there too. What is your level of experience with those? For beginners, there's a bunch of tutorials, for example perlrequick, perlretut, and here on PerlMonks. If you already have some experience with them, could you show your current code and describe what it's supposed to be doing vs. actually doing? (How do I post a question effectively?)

Re: HTML Parsing (ick)
by locked_user sundialsvc4 (Abbot) on Aug 20, 2014 at 12:27 UTC

    What you would have liked to have done (at the time) was to wrap all of the significant content into <span>s which had, say, an identifying class=, even if that class-definition specified nothing further as to the content.   The class-name would have served as a semantic tag to conclusively identify, within the data-stream itself, what the relevant bits of content were, so that an XPath expression (like the one shown in a previous comment) could have been used consistently to extract it.   Otherwise, “parsing the HTML is the easy part, and reliably picking-out the data within that HTML is the hard part.”   It will depend on finding totally-reliable place markers within the templates, and making 100% sure that it gets all the right data in every case.

    That being said ... what are you chances, now, of being able to make changes to the templates which (I hope ...) drive the production of those web pages?   Or do you work for someone else now?   ;-)   If you could add span-tags with dummy class-names, that certainly would make this job far more reliable and easy.   (With such tags, the whole job could be done using XSLT stylesheets.)

      My plan is to suck all the data in the web pages into a database and make it database driven system with dynamic web pages. I ran into problems with the complexity of the website (much of it style sheet and editing program issues and possibly user error) and decided the best thing was to avoid those complexities and make it all form based. Yes, in hindsight, if I knew I was going that way, I could have tagged even more....