Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight

Re^2: getting text from HTML

by IB2017 (Pilgrim)
on May 04, 2020 at 08:43 UTC ( #11116428=note: print w/replies, xml ) Need Help??

in reply to Re: getting text from HTML
in thread getting text from HTML

I am further experimenting with your great solution. I have an issue with the text being concatenated. Is there a way to separate, let's say with a simple white space, the text snippets the script extracts from the different sections of the page? If you look at the result you get, first line, you can see You are hereHome which should be separated. I can't see any option for my $text = $dom->all_text; (besides the trim all_text(0); which does not apply here)

Of course I can go with something like

$text = $res->dom('h1, h2, h3, p')->each(sub { say 'text: ', shift->al +l_text });

I am starting to love Mojo...

Replies are listed 'Best First'.
Re^3: getting text from HTML (updated x2)
by haukex (Bishop) on May 04, 2020 at 09:46 UTC

    Looking at the code of Mojo::DOM, it doesn't look like it's directly supported. But luckily it's not too difficult to add (you can of course put the package into its own .pm file):

    Update: I've modified the methods so that they return a nested set of Mojo::Collection objects of the callback results, so that walk is kind of like a tree-based map.

    Update 2: For an even more refined version, see here.

    use Mojo::Base -strict; use 5.014; use Mojo::UserAgent; use Mojo::DOM; use Mojo::Util qw/dumper/; package Mojo::DOM::Role::TreeWalker { use Mojo::Base -strict; use Role::Tiny; use Mojo::Collection; sub walk { $_[0]->_walk($_[1], 0) || Mojo::Collection->new } sub _walk { my ($self, $cb, $depth) = @_; my $c = Mojo::Collection->new; { local $_ = $self; push @$c, $cb->($self, $depth++); } my $rv = $self->child_nodes->map('_walk', $cb, $depth); push @$c, $rv if @$rv; @$c ? $c : (); } sub walk_text { my ($self, $cb) = @_; $self->walk(sub { $_->type eq 'cdata' || $_->type eq 'raw' || $_->type eq 'text' ? $cb->(@_) : () }); } } my $ua = Mojo::UserAgent->new( max_redirects => 3 ); my $res = $ua->get('')->result; die $res->message unless $res->is_success; my $dom = $res->dom; $dom->find('script, style')->map('remove'); my $texts = $dom->with_roles('+TreeWalker')->walk_text(sub { $_->content=~/\S/ ? $_->content=~s/^\s+|\s+$//gr : () })->flatten; print dumper $texts; __END__ bless( [ "STARLINK MISSION | SpaceX", "Jump to navigation", "Falcon 9", "Falcon Heavy", "Dragon", "Starship", "Updates", "About SpaceX", "Careers", "Shop", "You are here", "Home", "STARLINK MISSION", "On Wednesday, April 22 at 3:30 p.m. EDT, or 19:30 p.m. UTC, SpaceX +launched its seventh Starlink mission. Falcon 9 lifted off from Launc +h Complex 39A (LC-39A) at NASA\x{2019}s Kennedy Space Center in Flori +da.", "Falcon 9\x{2019}s first stage previously supported Crew Dragon\x{20 +19}s first flight to the International Space Station, launch of the R +ADARSAT Constellation Mission, and the fourth Starlink mission. Follo +wing stage separation, SpaceX landed Falcon 9\x{2019}s first stage on + the \x{201c}Of Course I Still Love You\x{201d} droneship, which was +stationed in the Atlantic Ocean. Falcon 9\x{2019}s fairing previously + supported the AMOS-17 mission.", "You can watch a replay of the launch below and learn more about the + mission", "here.", "|", "Twitter", "YouTube", "Flickr", "Instagram", "Privacy", "\x{a9} 2020 Space Exploration Technologies Corp." ], 'Mojo::Collection' )

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11116428]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (8)
As of 2022-05-24 09:58 GMT
Find Nodes?
    Voting Booth?
    Do you prefer to work remotely?

    Results (82 votes). Check out past polls.