Perl HTML confusion...

AI Cowboy has asked for the wisdom of the Perl Monks concerning the following question:

I'm having trouble with using Perl to parse an HTML file I have, where I'm trying to grab all <a> and <div> tags if the link or text content matches a certain format (I use a regex for this). However, WWW::Mechanize can only find links (<a> tags), not <div> tags, so that doesn't work. I've tried learning HTML::TreeBuilder but it seems that my brain doesn't understand the documentation very well for some reason.

I'm wondering if you chaps can either direct me to a better, cleaner Perl module that can extract all tags and let me analyze their attributes/text, or help me with my problem with HTML::TreeBuilder?

My problem is that with, for example, http://search.cpan.org/~cjm/HTML-Tree-5.03/lib/HTML/Element.pm#find_by_tag_name, I have no idea what $h is, or where it's coming from. It seems - to me - the documentation for TreeBuilder and Element use variables without explaining what they are explicitly, and this hurts my brain. Some help would be wonderful, as I need to finish this project by the end of the week for my job, and I'm not sure what to do or why I'm not understanding this.

Comment on Perl HTML confusion...

Replies are listed 'Best First'.
Re: Perl HTML confusion... by trippledubs (Deacon) on Sep 17, 2013 at 20:30 UTC
If you want to get all the divs on a page, you make a tree out of the full page, look down the tree and collect all the divs, and then you can use method as_text to print out just the text. `use HTML::TreeBuilder; my $tree = HTML::TreeBuilder->new_from_file('test.html'); my @divs = $tree->look_down(_tag => 'div'); print $divs[0]->as_text();` [download] I saved this node to test.html, and so it outputs your first post. Output: I'm having trouble with using Perl to parse an HTML file I have, where I'm trying to grab all <a>... I'm not going to repost it all, but the full text of your first post is there. When you want to match regular expressions you have to pass a sub ref to look_down. There is an example in HTML::Element. Also, here is a quick intro: HTML::Tree(Builder) in 6 minutes. And a more thorough article: HTML::Tree::Scanning	[reply] [d/l]
Re^2: Perl HTML confusion... by AI Cowboy (Beadle) on Sep 17, 2013 at 22:29 UTC
Many thanks for your post and help! This is great :)	[reply]
Re: Perl HTML confusion... by marinersk (Priest) on Sep 17, 2013 at 17:59 UTC
Welcome AI Cowboy, On the same page you mention ( http://search.cpan.org/~cjm/HTML-Tree-5.03/lib/HTML/Element.pm ), if you scroll to the top and search for `$h`, the first hit shows: `BASIC METHODS new $h = HTML::Element->new('tag', 'attrname' => 'value', ... ); This constructor method returns a new HTML::Element object. The tag name is a required argument; it will be forced to lowercase. Optionally, you can specify other initial attributes at object creation time.` What might not have been clear to you is that this was intended to set your expectations that all references to `$h` on that page were going to refer to an instantiated `HTML::Element` object. As to your original request (recommend a module), I'll leave that to the Monks who dabble far more often in HTML manipulation than I do. Good luck in the hunt!	[reply]
Re^2: Perl HTML confusion... by AI Cowboy (Beadle) on Sep 17, 2013 at 18:06 UTC
That makes sense - but that raises one more question I have, an HTML::Element object such as $h is one element, or all the elements on a page? basically... `$a = HTML::Element->new('a', href => 'http://www.perl.com/');` What does this instantiate? Is it looking for all links on the page www.perl.com? Sorry if this is a silly question, never used these objects/modules before today/yesterday :)	[reply] [d/l]
Re^3: Perl HTML confusion... by marinersk (Priest) on Sep 17, 2013 at 18:26 UTC
Yeah, sorry for opening Pandora's Box and then running away. Whatever the `HTML::Element` object is, `$h` is a single instance of that object. What is it? I'd have to read the documentation, same as you. I would expect that someone who actually uses the module will wander by and pick up the torch here. My advice on this topic would be increasingly (if that's even possible) more theoretical than it already is.	[reply]
Re^3: Perl HTML confusion... by Anonymous Monk on Sep 18, 2013 at 02:49 UTC
bless({ _tag => "a", href => "http://www.perl.com/" }, "HTML::Element")	[reply]
Re: Perl HTML confusion... by Happy-the-monk (Canon) on Sep 17, 2013 at 18:05 UTC
`<\a> and <\div>` You are making me dizzy... now, you might not have noticed while typing too fast: slashes in closing xml and html elements go all forward, like this: `/` the character sequences `\a` and `\d` may have funny side effects on stone age terminals... the first one usually rings a bell. Cheers, Sören Créateur des bugs mobiles - let loose once, run everywhere. (hooked on the Perl Programming language)	[reply] [d/l] [select]
Re^2: Perl HTML confusion... by AI Cowboy (Beadle) on Sep 17, 2013 at 18:08 UTC
Actually, I was simply trying to find a way to type those tags, without the form thinking I was trying to type PerlMonks-approved HTML into it. Sorry for the confusion!	[reply]
Re^3: Perl HTML confusion... by marinersk (Priest) on Sep 17, 2013 at 18:34 UTC
Ah, for that I go old school: `<a>` yields `<a>`, for example.	[reply]
Re^4: Perl HTML confusion... by Happy-the-monk (Canon) on Sep 17, 2013 at 18:46 UTC
Re^5: Perl HTML confusion... by marinersk (Priest) on Sep 17, 2013 at 18:51 UTC
Re: Perl HTML confusion... by Anonymous Monk on Sep 18, 2013 at 03:11 UTC
https://metacpan.org/module/HTML::Tree#DESCRIPTION https://metacpan.org/module/HTML::Tree#DESCRIPTION https://metacpan.org/module/HTML::Tree#new_from_content #!/usr/bin/perl -- use strict; use warnings; use HTML::Tree; use Data::Dump qw/dd pp /; my $t = HTML::Tree->new_from_content(q{ <a href="1">1</a> <a href="q">q</a> <a src="ERR">ERR</a> <img src="2"> <link href="3"> <link src="4ERR"> }); dd( $t ); for my $l ( $t->look_down( qw/ _tag a /) ){ { local $$l{_parent}; dd( $l ); } dd( ref $l, $l->tag, $l->attr('href'), $l->as_text ); } for my $leat (@{ $t->extract_links( ) }) { my($link, $element, $attr, $tag) = @$leat; print "Hey, there's a '$tag' that links to ", $link, ", in its '$attr' attribute, at ", $element->address(), ".\n"; } print "\n", $t->as_HTML(undef, ' '); __END__ [download] see more stuff Re: TreeBuilder and encoding, Use the twig :) Processing XML efficiently with Perl and XML::Twig See also htmltreexpather.pl and xpather.pl htmltreexpather.pl , Parsing HTML / Re^4: Parsing HTML, A regex question , NASA's Astronomy Picture of the Day / Re: NASA's Astronomy Picture of the Day , Re: Extracting HTML content between the h tags, Re^2: Help With Online Table Scraper, Re^4: web::scraper using an xpath, .... HTML Parser suggestions xpather.pl Re: Get Node Value from irregular XML (xpather.pl) Re: Having trouble with siblings Re^2: XML parsing and Lists Re: Counting number of child nodes based on element value (typos) Re^3: Extracting specific childnodes (xpath whitespace) Re^3: Extracting specific childnodes (play xmllint --shell ) Re: How do i get value of an element if the next elememnt has specific value in XML::LibXML using Xpath? Re: How do i get value of an element if the next elememnt has specific value in XML::LibXML using Xpath? Re: How to parse xml with namespase vale in XMl:LibXML? ( XPath error : Undefined namespace prefix ) Re^2: How to parse xml with namespase vale in XMl:LibXML? (xmllint --shell setns / xpathtester) There is a better way :) because xml::parser is low level, you should parse html/xml with xpath/twig/dom, Re: How to grab a portion of file with regex (don't), HTML Parser suggestions See also the real discouragement Oh Yes You Can Use Regexes to Parse HTML! and the real encouragement Re^2: parsing XML fragments (xml log files) with... a regex How do I match XML, HTML, or other nasty, ugly things with a regex? How do I remove HTML from a string? Re: Parsing webpages See htmltreexpather.pl , Parsing HTML / Re^4: Parsing HTML, A regex question , NASA's Astronomy Picture of the Day / Re: NASA's Astronomy Picture of the Day , Re: Extracting HTML content between the h tags, Re^2: Help With Online Table Scraper, Re^4: web::scraper using an xpath, .... HTML Parser suggestions	[reply] [d/l]

Back to Seekers of Perl Wisdom