Remove html tags to obtain plain text

Mj1234 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Remove html tags to obtain plain text by hippo (Archbishop) on Jun 29, 2016 at 10:19 UTC
FAQ: How do I remove HTML from a string?	[reply]
Re: Remove html tags to obtain plain text by marto (Cardinal) on Jun 29, 2016 at 10:59 UTC
If this really is your source data, and you wish to extract each `td` in a table: use strict; use warnings; use feature 'say'; use Mojo::DOM; my $p = Mojo::DOM->new(); my $string = '<style>table{border-collapse: collapse;margin-left: +1cm;font-Family: courier;width: 60%}.hoverTable tr{background: #D8D8D +8;} .hoverTable tr:hover{background-color: #ffff99; }</style><table b +order=2 class="hoverTable">[20160628_151916] <tr><td bgcolor="#366092 +"><font color="White"> PLAIN TEXT TO BE EXTRACTED</td>'; my $dom = Mojo::DOM->new( $string ); $dom->find('tr td')->each( sub{ my $td = shift; say $td->children->map('text')->join('\n'); }); [download]	[reply] [d/l] [select]
Re: Remove html tags to obtain plain text by davies (Monsignor) on Jun 29, 2016 at 15:25 UTC
I hope you realise that your HTML is pretty seriously broken. If that's carelessness on your part, it's not a good sign. If it's typical of what you are likely to get in real life, I understand. I found it easier to write my own parser than to work through the HTML::Parser docs & I don't remember them defining how it works with broken HTML like yours. My parser, XML::Lenient, is specifically intended to cope. Two ways of extracting your text are shown in the code below: `use Modern::Perl; use XML::Lenient; my $p = XML::Lenient->new(); my $string = '<style>table{border-collapse: collapse;margin-left: 1cm; +font-Family: courier;width: 60%}.hoverTable tr{background: #D8D8D8;} +.hoverTable tr:hover{background-color: #ffff99; }</style><table borde +r=2 class="hoverTable">[20160628_151916] <tr><td bgcolor="#366092"><f +ont color="White"> PLAIN TEXT TO BE EXTRACTED</td>'; say $p->innertext($p->within($string, 'td')); say $p->wpath($string, 'td/font');` [download] As you don't tell us why you want to use HTML::Parser, I have no idea whether my module would be better for you, though. And remember that I'm biased, like any parent. Regards, John Davies	[reply] [d/l]
Re^2: Remove html tags to obtain plain text by Mj1234 (Sexton) on Jun 30, 2016 at 05:19 UTC
This is just part of the complete HTML text. I am trying to use HTML::Parser as I am unable to use HTML::Strip.	[reply]
Re^3: Remove html tags to obtain plain text by Anonymous Monk on Jun 30, 2016 at 06:41 UTC
I am trying to use HTML::Parser as I am unable to use HTML::Strip. Then use one of the other "html strip" modules	[reply]
Re: Remove html tags to obtain plain text by Anonymous Monk on Jun 29, 2016 at 10:03 UTC
I have a string as shown below. I want to extract only the plain text and assign it to a scalar using HTML::Parser. How can this be done? Copy paste from existing solutions that use HTML::Parser. What existing solutions you say? Go fish	[reply]