html parsing

bigup401 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: html parsing by haukex (Archbishop) on Mar 22, 2017 at 14:30 UTC
I discussed some of the options for parsing HTML and gave some example code here: "Two classic modules are HTML::Parser and HTML::TreeBuilder, but there are several others, such as Mojo::DOM. If the input is always XHTML, there's XML::Twig and many more XML-based modules." If all you want to do is strip HTML tags, then this is a FAQ: How do I remove HTML from a string? Use HTML::Strip, or HTML::FormatText which not only removes HTML but also attempts to do a little simple formatting of the resulting plain text.	[reply]
Re: html parsing by davido (Cardinal) on Mar 22, 2017 at 14:44 UTC
Mojolicious contains an excellent tool kit that includes Mojo::UserAgent and Mojo::Dom. Here's a one liner: `perl -Mojo -E 'say g("perlmonks.org")->dom->at("title")->text'` [download] Output: `PerlMonks - The Monastery Gates` [download] Minimally, Mojolicious requires no additional external dependencies --just a relatively recent Perl. The distribution size is around 2MB once it's unpacked and installed, and it takes about a minute to install: `cpanm Mojolicious` [download] ...or via your preferred module installation technique. Dave	[reply] [d/l] [select]
Re^2: html parsing by Anonymous Monk on Mar 22, 2017 at 15:07 UTC
Mojo rocks.	[reply]
Re: html parsing by Corion (Patriarch) on Mar 22, 2017 at 14:28 UTC
See for example App::scrape, which is basically a simple application of HTML::Selector::XPath to extract data from HTML. There also are many other general scrapers, like Web::Scraper, Web::Query. Most of them build on something like HTML::TreeBuilder.	[reply]
Re: html parsing by marto (Cardinal) on Mar 22, 2017 at 14:32 UTC
I second the suggestion of Mojo::DOM, however if you're trying to scrape google search results I suggest investigating their various APIs rather parsing results.	[reply]
Re: html parsing by hippo (Archbishop) on Mar 22, 2017 at 15:12 UTC
HTML::FormatText::Html2text is designed to do precisely that (as its name might suggest).	[reply]
Re: html parsing by shmem (Chancellor) on Mar 22, 2017 at 15:11 UTC
`my $req = HTTP::Request->new(GET => 'https://www.google.com'); $req->content_type('application/json'); my $res = $ua->request($req);` [download] Please edit your post including the initialization of `$ua`: `use LWP; my $ua = LWP::UserAgent->new; my $req = HTTP::Request->new(GET => 'https://www.google.com'); $req->content_type('application/json'); my $res = $ua->request($req);` [download] Thank you. You could get the contents of the `<title>` tag just using a regular expression `my $title; $res->content =~ m\|<title>(.+?)</title>\|i and $title = $1;` [download] but see e.g. Re: Why this simple regex freeze my computer? for caveats. perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'	[reply] [d/l] [select]
Re^2: html parsing by Anonymous Monk on Mar 22, 2017 at 15:25 UTC
"You could get the contents of the <title> tag just using a regular expression" Solutions that use regular expressions to parse HTML will never be voted higher than those that actually use a parser. Also your assignment is very low value because it explicitly uses $1 when you could have instead captured the value directly (and safer too).	[reply]
Re^3: html parsing by shmem (Chancellor) on Mar 22, 2017 at 15:40 UTC
I'd downvote my answer, if I could, not only for the shameless plug. Also your assignment is very low value because it explicitly uses $1 when you could have instead captured the value directly (and safer too). Providing code for that end might significantly improve this subthread. perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'	[reply]
Re^4: html parsing by Anonymous Monk on Mar 22, 2017 at 15:51 UTC
Re^2: html parsing by bigup401 (Pilgrim) on Mar 22, 2017 at 19:53 UTC
thanks guys, thanks shmem. your idea has worked for me	[reply]