comment on

I must admit: I dont understand the DOM and scraping websites is a pain if you dont know it.

The basics of DOM are actually not all too difficult - it's basically a tree structure with nodes of different types. They're often represented as objects with a base "node" class that supports methods like "what are the children of this node", and the different node types are implemented as subclasses of this node (XML::LibXML works this way; Mojo::DOM AFAIK doesn't, but these are just implementation details). The two most common are "element" nodes, that represent <elements>s (including their attributes), and text nodes, that represent any text in between elements. There's also "comment" nodes that represent , etc.

In my experience, probably one of the most common things to confuse people is that this structure is very formal and rigid, asking a question like "what is the text content of <p>Hello, <b>cool</b> World!</p>?" is not as obvious as one might think. This <p> element has three children: the text "Hello, ", the element <b>, and the text " World!". To get all the text content means to walk down the tree and include the text child node "cool" of the <b> element too. Most libraries have functions that do this for you though.

Anyway, one nice thing about Mojo::DOM is that it supports CSS selectors. This is related to the DOM of course, but actually simplifies finding things in the DOM a lot. They're a little bit like a more flexible XPath. See Mojo::DOM::CSS: ids can be selected via #idname and classes can be selected via .classname, with automatic handling of multiple classes, e.g. your class="fa fa fa-mobile-phone" can be selected via e.g. .fa-mobile-phone or perhaps .fa.fa-mobile-phone, though interestingly I don't see a mention of the latter in the docs (it's in the W3C specs though).

Your HTML appears to be structured as a class="item-list" with <div class="item">s containing the data, so that's what I'd start with. What I think is quite strange is <span><i class="fa fa fa-phone"></i>011111111</span>, it's unclear to me why the class="fa fa fa-phone" isn't on the <span> that actually contains the data but is instead on the empty <i> in front of it. But oh well, we can deal with that too. (Update: Oh, they're Font Awesome icons.)

use Mojo::Base -strict, -signatures;
use Mojo::DOM;
use Mojo::Util qw/trim dumper/;

my $dom = Mojo::DOM->new( do { local $/; <DATA> } );

my %members;
$dom->find('#members-list .item')->map(sub {
    # assume only one .item-title (use ->find instead of ->at otherwis
+e)
    my $name = trim( $_->at('.item-title')->all_text );
    $_->find('.woffice-xprofile-list .fa')->map(sub {
        my $class = $_->attr('class');
        # go up one node from the <i> to the <span>
        my $content = $_->parent->all_text;
        # assume no duplicates
        $members{$name}{$class} = $content;
    });
});
print dumper(\%members);
[download]

In reply to Re: extracting sub elements from DOM by class by haukex
in thread extracting sub elements from DOM by class by Discipulus

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


No such thing as a small change
	PerlMonks