HTML Parsing (ick)

dbarron has asked for the wisdom of the Perl Monks concerning the following question:

Ok, I need to parse webpages (that I wrote, but has been modified by others), and extract pertinent information stored within <div> </div> tags of class 'listing'. I'll list a sample entry below and then detail what I wish to parse out of it, with the format of another sample. Hopefully this will make sense and I'll gladly accept any advise as to which modules to use to make this easier.

<div class="listing">
Agave parryi&nbsp;&nbsp;&nbsp;&nbsp;
<span style="font-weight: normal;">Parry's agave</span>
<br>$20.00&nbsp; 3 quart&nbsp;&nbsp;&nbsp;
$12.00 Quart
<br><span id="native">Native</span>&nbsp;&nbsp;&nbsp; Sun
to part shade&nbsp; Zones 5-10&nbsp; Family: <i>Amaryllidaceae</i>
<br>From the Southwest comes this lovely agave.&nbsp; Thick spiny
leaves adorn this hardy agave.&nbsp; Ultimate
clump size is about 36" with each leaf being maybe 5" across. The
flower stalk can reach 12 feet tall. Please plant in well drained soil
in a place
where children don't play. <span id="hummingbird">Hummingbirds</span>
</div>
[download]

Ok, what I'd like to get out of this (and there's a lot more html junk around it to ignore) is:
Latin name (ie agave parryi)
Common name (Parry's agave)
Pot price ($20.00)
Pot size (3 quart)
Pot price ($12.00)
Pot size (quart)
Origin: Native
Exposure: Sun to part shade
Hardiness: 5-10
Family: Amaryllidacea
Text description:From the Southwest comes this lovely agave. Thick spiny leaves adorn this hardy agave. Ultimate clump size is about 36" with each leaf being maybe 5" across. The flower stalk can reach 12 feet tall. Please plant in well drained soil in a place where children don't play.
Special Features: Hummingbirds (there's others of those...but I can handle generalization (I think))
Ok, sorry for such a long post...but I wanted to give a good thorough example.

Comment on HTML Parsing (ick) Select or Download Code

Replies are listed 'Best First'.
Re: HTML Parsing (ick) by tangent (Parson) on Aug 20, 2014 at 04:01 UTC
This little snippet should get you most of the way: `use HTML::TreeBuilder::XPath; my $html = q\|<div class="listing"> Agave parryi     ... </div>\|; $html =~ s/ / /g; my $tree = HTML::TreeBuilder::XPath->new_from_content($html); my @nodes = $tree->findnodes('//div[@class="listing"]'); for my $node (@nodes) { my @contents = $node->content_list; for my $content (@contents) { if (ref $content) { my $text = $content->as_text or next; my $tag = $content->tag; print "<$tag> $text\n\n"; } else { print "$content\n\n"; } } }` [download] Output: `Agave parryi <span> Parry's agave $20.00 3 quart $12.00 Quart <span> Native Sun to part shade Zones 5-10 Family: <i> Amaryllidaceae From the Southwest comes this lovely agave. Thick spiny leaves adorn t +his hardy agave. Ultimate clump size is about 36" with each leaf bein +g maybe 5" across. The flower stalk can reach 12 feet tall. Please pl +ant in well drained soil in a place where children don't play. <span> Hummingbirds` [download]	[reply] [d/l] [select]
Re^2: HTML Parsing (ick) by dbarron (Novice) on Aug 20, 2014 at 11:54 UTC
Ah....excellent, and my thanks Tangent. Now, I'll see if I can get the TreeBuilder module installed under Windows. I couldn't get one of it's dependencies compiled cleanly under Linux, I think maybe Html::entities.	[reply]
Re^3: HTML Parsing (ick) by dbarron (Novice) on Aug 20, 2014 at 13:05 UTC
Ok, after beating the dead horse (cpan) a bit...and manually doing makes and make installs....I have your test program running and ready to expand it! Thanks again.	[reply]
Re: HTML Parsing (ick) by Anonymous Monk on Aug 19, 2014 at 20:27 UTC
HTML parsing with Perl is an extremely common task, did you try Googling it? HTML::Parser and HTML::TreeBuilder immediately come to mind.	[reply]
Re^2: HTML Parsing (ick) by dbarron (Novice) on Aug 20, 2014 at 00:01 UTC
Yes, I did..but I wondered if I was choosing appropriately using HTML::Parser, as (for instance) the description element was apparently not retrievable via it's methods. In that case, It seemed I might as well write it all from scratch (and I didn't want to do that). I used to be fairly good with PERL, but I haven't really used it in about five years (change of occupations, lifestyle, etc). Thus, I was trolling the waters looking to see if anyone had some good suggestions besides what I'd recently found.	[reply]
Re^3: HTML Parsing (ick) by Anonymous Monk on Aug 20, 2014 at 00:33 UTC
Yes, I did..but I wondered if I was choosing appropriately using HTML::Parser, as the description element was apparently not retrievable via it's methods. HTML::Parser is low level, its never a solution :) Re: How to grab a portion of file with regex (don't)(parsing html/xml with xpath/twig/dom, because html::parser is low level), Re: How to grab a portion of file with regex (parsing html/xml with xpath/twig/dom, because xml::parser is low level), Re^4: How to grab a portion of file with regex (parsing html/xml with xpath/twig/dom, because ::parser is low level)	[reply]
Re^4: HTML Parsing (ick) by dbarron (Novice) on Aug 20, 2014 at 00:36 UTC
Re: HTML Parsing (ick) by Anonymous Monk on Aug 19, 2014 at 22:02 UTC
see all the links here Re: Retrieve select information from HTML, they're examples(for tree-xpath and others)/walkthroughs/tutorials ... tools like xpather.pl/htmltreexpather.pl can give you paths to start with	[reply]
Re: HTML Parsing (ick) by Anonymous Monk on Aug 19, 2014 at 21:19 UTC
In addition to parsing the HTML, it looks like you'll need some regular expressions to pull the text elements out of there too. What is your level of experience with those? For beginners, there's a bunch of tutorials, for example perlrequick, perlretut, and here on PerlMonks. If you already have some experience with them, could you show your current code and describe what it's supposed to be doing vs. actually doing? (How do I post a question effectively?)	[reply]
Re: HTML Parsing (ick) by locked_user sundialsvc4 (Abbot) on Aug 20, 2014 at 12:27 UTC
What you would have liked to have done (at the time) was to wrap all of the significant content into `<span>`s which had, say, an identifying `class=`, even if that class-definition specified nothing further as to the content. The class-name would have served as a semantic tag to conclusively identify, within the data-stream itself, what the relevant bits of content were, so that an XPath expression (like the one shown in a previous comment) could have been used consistently to extract it. Otherwise, “parsing the HTML is the easy part, and reliably picking-out the data within that HTML is the hard part.” It will depend on finding totally-reliable place markers within the templates, and making 100% sure that it gets all the right data in every case. That being said ... what are you chances, now, of being able to make changes to the templates which (I hope ...) drive the production of those web pages? Or do you work for someone else now? `;-)` If you could add span-tags with dummy class-names, that certainly would make this job far more reliable and easy. (With such tags, the whole job could be done using XSLT stylesheets.)
Re^2: HTML Parsing (ick) by dbarron (Novice) on Aug 20, 2014 at 13:03 UTC
My plan is to suck all the data in the web pages into a database and make it database driven system with dynamic web pages. I ran into problems with the complexity of the website (much of it style sheet and editing program issues and possibly user error) and decided the best thing was to avoid those complexities and make it all form based. Yes, in hindsight, if I knew I was going that way, I could have tagged even more....	[reply]