in reply to getting text from HTML

I believe what you're seeing is the concept of the Document Object Model, where basically "text nodes" are anything that's not an element, including everything between <script> tags etc. One easy workaround is to clobber all the tags you don't want:

use Mojo::Base -strict; use open qw/:std :utf8/; use Mojo::UserAgent; my $ua = Mojo::UserAgent->new( max_redirects => 3 ); my $res = $ua->get('http://www.spacex.com/webcast')->result; die $res->message unless $res->is_success; my $dom = $res->dom; $dom->find('script, style')->map('remove'); my $text = $dom->at('body')->all_text; 1 while $text =~ s/\s{2,}/ /g; say $text; __END__ Jump to navigation Falcon 9 Falcon Heavy Dragon Starship Updates About SpaceX Careers Shop You are hereHome STARLINK MISSION On Wednesday, April 22 at 3:30 +p.m. EDT, or 19:30 p.m. UTC, SpaceX launched its seventh Starlink mis +sion. Falcon 9 lifted off from Launch Complex 39A (LC-39A) at NASA’s +Kennedy Space Center in Florida.Falcon 9’s first stage previously sup +ported Crew Dragon’s first flight to the International Space Station, + launch of the RADARSAT Constellation Mission, and the fourth Starlin +k mission. Following stage separation, SpaceX landed Falcon 9’s first + stage on the “Of Course I Still Love You” droneship, which was stati +oned in the Atlantic Ocean. Falcon 9’s fairing previously supported t +he AMOS-17 mission. You can watch a replay of the launch below and le +arn more about the mission here. | Twitter YouTube Flickr Instagram P +rivacy © 2020 Space Exploration Technologies Corp.

Replies are listed 'Best First'.
Re^2: getting text from HTML
by IB2017 (Pilgrim) on May 04, 2020 at 08:43 UTC

    I am further experimenting with your great solution. I have an issue with the text being concatenated. Is there a way to separate, let's say with a simple white space, the text snippets the script extracts from the different sections of the page? If you look at the result you get, first line, you can see You are hereHome which should be separated. I can't see any option for my $text = $dom->all_text; (besides the trim all_text(0); which does not apply here)

    Of course I can go with something like

    $text = $res->dom('h1, h2, h3, p')->each(sub { say 'text: ', shift->al +l_text });

    I am starting to love Mojo...

      Looking at the code of Mojo::DOM, it doesn't look like it's directly supported. But luckily it's not too difficult to add (you can of course put the package into its own .pm file):

      Update: I've modified the methods so that they return a nested set of Mojo::Collection objects of the callback results, so that walk is kind of like a tree-based map.

      Update 2: For an even more refined version, see here.

      use Mojo::Base -strict; use 5.014; use Mojo::UserAgent; use Mojo::DOM; use Mojo::Util qw/dumper/; package Mojo::DOM::Role::TreeWalker { use Mojo::Base -strict; use Role::Tiny; use Mojo::Collection; sub walk { $_[0]->_walk($_[1], 0) || Mojo::Collection->new } sub _walk { my ($self, $cb, $depth) = @_; my $c = Mojo::Collection->new; { local $_ = $self; push @$c, $cb->($self, $depth++); } my $rv = $self->child_nodes->map('_walk', $cb, $depth); push @$c, $rv if @$rv; @$c ? $c : (); } sub walk_text { my ($self, $cb) = @_; $self->walk(sub { $_->type eq 'cdata' || $_->type eq 'raw' || $_->type eq 'text' ? $cb->(@_) : () }); } } my $ua = Mojo::UserAgent->new( max_redirects => 3 ); my $res = $ua->get('http://www.spacex.com/webcast')->result; die $res->message unless $res->is_success; my $dom = $res->dom; $dom->find('script, style')->map('remove'); my $texts = $dom->with_roles('+TreeWalker')->walk_text(sub { $_->content=~/\S/ ? $_->content=~s/^\s+|\s+$//gr : () })->flatten; print dumper $texts; __END__ bless( [ "STARLINK MISSION | SpaceX", "Jump to navigation", "Falcon 9", "Falcon Heavy", "Dragon", "Starship", "Updates", "About SpaceX", "Careers", "Shop", "You are here", "Home", "STARLINK MISSION", "On Wednesday, April 22 at 3:30 p.m. EDT, or 19:30 p.m. UTC, SpaceX +launched its seventh Starlink mission. Falcon 9 lifted off from Launc +h Complex 39A (LC-39A) at NASA\x{2019}s Kennedy Space Center in Flori +da.", "Falcon 9\x{2019}s first stage previously supported Crew Dragon\x{20 +19}s first flight to the International Space Station, launch of the R +ADARSAT Constellation Mission, and the fourth Starlink mission. Follo +wing stage separation, SpaceX landed Falcon 9\x{2019}s first stage on + the \x{201c}Of Course I Still Love You\x{201d} droneship, which was +stationed in the Atlantic Ocean. Falcon 9\x{2019}s fairing previously + supported the AMOS-17 mission.", "You can watch a replay of the launch below and learn more about the + mission", "here.", "|", "Twitter", "YouTube", "Flickr", "Instagram", "Privacy", "\x{a9} 2020 Space Exploration Technologies Corp." ], 'Mojo::Collection' )
Re^2: getting text from HTML
by IB2017 (Pilgrim) on May 03, 2020 at 22:58 UTC

    Very nice thank you. I am also experimenting with the Mojo::UserAgent which seems more modern the the one I was using (all the time).