http://qs1969.pair.com?node_id=1054495

AI Cowboy has asked for the wisdom of the Perl Monks concerning the following question:

I'm having trouble with using Perl to parse an HTML file I have, where I'm trying to grab all <a> and <div> tags if the link or text content matches a certain format (I use a regex for this). However, WWW::Mechanize can only find links (<a> tags), not <div> tags, so that doesn't work. I've tried learning HTML::TreeBuilder but it seems that my brain doesn't understand the documentation very well for some reason.

I'm wondering if you chaps can either direct me to a better, cleaner Perl module that can extract all tags and let me analyze their attributes/text, or help me with my problem with HTML::TreeBuilder?

My problem is that with, for example, http://search.cpan.org/~cjm/HTML-Tree-5.03/lib/HTML/Element.pm#find_by_tag_name, I have no idea what $h is, or where it's coming from. It seems - to me - the documentation for TreeBuilder and Element use variables without explaining what they are explicitly, and this hurts my brain. Some help would be wonderful, as I need to finish this project by the end of the week for my job, and I'm not sure what to do or why I'm not understanding this.

Replies are listed 'Best First'.
Re: Perl HTML confusion...
by trippledubs (Deacon) on Sep 17, 2013 at 20:30 UTC
    If you want to get all the divs on a page, you make a tree out of the full page, look down the tree and collect all the divs, and then you can use method as_text to print out just the text.
    use HTML::TreeBuilder; my $tree = HTML::TreeBuilder->new_from_file('test.html'); my @divs = $tree->look_down(_tag => 'div'); print $divs[0]->as_text();

    I saved this node to test.html, and so it outputs your first post.

    Output:

    I'm having trouble with using Perl to parse an HTML file I have, where I'm trying to grab all <a>...

    I'm not going to repost it all, but the full text of your first post is there.

    When you want to match regular expressions you have to pass a sub ref to look_down. There is an example in HTML::Element. Also, here is a quick intro: HTML::Tree(Builder) in 6 minutes. And a more thorough article: HTML::Tree::Scanning

      Many thanks for your post and help! This is great :)
Re: Perl HTML confusion...
by marinersk (Priest) on Sep 17, 2013 at 17:59 UTC

    Welcome AI Cowboy,

    On the same page you mention ( http://search.cpan.org/~cjm/HTML-Tree-5.03/lib/HTML/Element.pm ), if you scroll to the top and search for $h, the first hit shows:

    BASIC METHODS
    new
    $h = HTML::Element->new('tag', 'attrname' => 'value', ... );

    This constructor method returns a new HTML::Element object. The tag name is a required argument; it will be forced to lowercase. Optionally, you can specify other initial attributes at object creation time.

    What might not have been clear to you is that this was intended to set your expectations that all references to $h on that page were going to refer to an instantiated HTML::Element object.

    As to your original request (recommend a module), I'll leave that to the Monks who dabble far more often in HTML manipulation than I do.

    Good luck in the hunt!

      That makes sense - but that raises one more question I have, an HTML::Element object such as $h is one element, or all the elements on a page? basically... $a = HTML::Element->new('a', href => 'http://www.perl.com/'); What does this instantiate? Is it looking for all links on the page www.perl.com? Sorry if this is a silly question, never used these objects/modules before today/yesterday :)

        Yeah, sorry for opening Pandora's Box and then running away.

        Whatever the HTML::Element object is, $h is a single instance of that object. What is it? I'd have to read the documentation, same as you.

        I would expect that someone who actually uses the module will wander by and pick up the torch here. My advice on this topic would be increasingly (if that's even possible) more theoretical than it already is.

        bless({ _tag => "a", href => "http://www.perl.com/" }, "HTML::Element")

Re: Perl HTML confusion...
by Happy-the-monk (Canon) on Sep 17, 2013 at 18:05 UTC

    <\a> and <\div>

    You are making me dizzy... now, you might not have noticed while typing too fast: slashes in closing xml and html elements go all forward, like this: /

    the character sequences \a and \d may have funny side effects on stone age terminals... the first one usually rings a bell.

    Cheers, Sören

    Créateur des bugs mobiles - let loose once, run everywhere.
    (hooked on the Perl Programming language)

      Actually, I was simply trying to find a way to type those tags, without the form thinking I was trying to type PerlMonks-approved HTML into it. Sorry for the confusion!
        Ah, for that I go old school:

        &lt;a&gt; yields <a>, for example.

Re: Perl HTML confusion...
by Anonymous Monk on Sep 18, 2013 at 03:11 UTC