Re: extracting sub elements from DOM by class

I must admit: I dont understand the DOM and scraping websites is a pain if you dont know it.

The basics of DOM are actually not all too difficult - it's basically a tree structure with nodes of different types. They're often represented as objects with a base "node" class that supports methods like "what are the children of this node", and the different node types are implemented as subclasses of this node (XML::LibXML works this way; Mojo::DOM AFAIK doesn't, but these are just implementation details). The two most common are "element" nodes, that represent <elements>s (including their attributes), and text nodes, that represent any text in between elements. There's also "comment" nodes that represent , etc.

In my experience, probably one of the most common things to confuse people is that this structure is very formal and rigid, asking a question like "what is the text content of Hello, cool World!?" is not as obvious as one might think. This  element has three children: the text "Hello, ", the element , and the text " World!". To get all the text content means to walk down the tree and include the text child node "cool" of the  element too. Most libraries have functions that do this for you though.

Anyway, one nice thing about Mojo::DOM is that it supports CSS selectors. This is related to the DOM of course, but actually simplifies finding things in the DOM a lot. They're a little bit like a more flexible XPath. See Mojo::DOM::CSS: ids can be selected via #idname and classes can be selected via .classname, with automatic handling of multiple classes, e.g. your class="fa fa fa-mobile-phone" can be selected via e.g. .fa-mobile-phone or perhaps .fa.fa-mobile-phone, though interestingly I don't see a mention of the latter in the docs (it's in the W3C specs though).

Your HTML appears to be structured as a class="item-list" with <div class="item">s containing the data, so that's what I'd start with. What I think is quite strange is 011111111, it's unclear to me why the class="fa fa fa-phone" isn't on the  that actually contains the data but is instead on the empty  in front of it. But oh well, we can deal with that too. (Update: Oh, they're Font Awesome icons.)

use Mojo::Base -strict, -signatures;
use Mojo::DOM;
use Mojo::Util qw/trim dumper/;

my $dom = Mojo::DOM->new( do { local $/; <DATA> } );

my %members;
$dom->find('#members-list .item')->map(sub {
    # assume only one .item-title (use ->find instead of ->at otherwis
+e)
    my $name = trim( $_->at('.item-title')->all_text );
    $_->find('.woffice-xprofile-list .fa')->map(sub {
        my $class = $_->attr('class');
        # go up one node from the <i> to the <span>
        my $content = $_->parent->all_text;
        # assume no duplicates
        $members{$name}{$class} = $content;
    });
});
print dumper(\%members);
[download]

Comment on Re: extracting sub elements from DOM by class Select or Download Code


go ahead... be a heretic
	PerlMonks