Mj1234 has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

I have a string as shown below. I want to extract only the plain text and assign it to a scalar using HTML::Parser. How can this be done?

$string = '<style>table{border-collapse: collapse;margin-left: 1cm;fon +t-Family: courier;width: 60%}.hoverTable tr{background: #D8D8D8;} .ho +verTable tr:hover{background-color: #ffff99; }</style><table border=2 + class="hoverTable">[20160628_151916] <tr><td bgcolor="#366092"><font + color="White"> PLAIN TEXT TO BE EXTRACTED</td>';

Replies are listed 'Best First'.
Re: Remove html tags to obtain plain text
by hippo (Archbishop) on Jun 29, 2016 at 10:19 UTC
Re: Remove html tags to obtain plain text
by marto (Cardinal) on Jun 29, 2016 at 10:59 UTC

    If this really is your source data, and you wish to extract each td in a table:

    use strict; use warnings; use feature 'say'; use Mojo::DOM; my $p = Mojo::DOM->new(); my $string = '<style>table{border-collapse: collapse;margin-left: +1cm;font-Family: courier;width: 60%}.hoverTable tr{background: #D8D8D +8;} .hoverTable tr:hover{background-color: #ffff99; }</style><table b +order=2 class="hoverTable">[20160628_151916] <tr><td bgcolor="#366092 +"><font color="White"> PLAIN TEXT TO BE EXTRACTED</td>'; my $dom = Mojo::DOM->new( $string ); $dom->find('tr td')->each( sub{ my $td = shift; say $td->children->map('text')->join('\n'); });
Re: Remove html tags to obtain plain text
by davies (Monsignor) on Jun 29, 2016 at 15:25 UTC

    I hope you realise that your HTML is pretty seriously broken. If that's carelessness on your part, it's not a good sign. If it's typical of what you are likely to get in real life, I understand.

    I found it easier to write my own parser than to work through the HTML::Parser docs & I don't remember them defining how it works with broken HTML like yours. My parser, XML::Lenient, is specifically intended to cope. Two ways of extracting your text are shown in the code below:

    use Modern::Perl; use XML::Lenient; my $p = XML::Lenient->new(); my $string = '<style>table{border-collapse: collapse;margin-left: 1cm; +font-Family: courier;width: 60%}.hoverTable tr{background: #D8D8D8;} +.hoverTable tr:hover{background-color: #ffff99; }</style><table borde +r=2 class="hoverTable">[20160628_151916] <tr><td bgcolor="#366092"><f +ont color="White"> PLAIN TEXT TO BE EXTRACTED</td>'; say $p->innertext($p->within($string, 'td')); say $p->wpath($string, 'td/font');

    As you don't tell us why you want to use HTML::Parser, I have no idea whether my module would be better for you, though. And remember that I'm biased, like any parent.

    Regards,

    John Davies

      This is just part of the complete HTML text. I am trying to use HTML::Parser as I am unable to use HTML::Strip.

        I am trying to use HTML::Parser as I am unable to use HTML::Strip.

        Then use one of the other "html strip" modules

Re: Remove html tags to obtain plain text
by Anonymous Monk on Jun 29, 2016 at 10:03 UTC

    I have a string as shown below. I want to extract only the plain text and assign it to a scalar using HTML::Parser. How can this be done?

    Copy paste from existing solutions that use HTML::Parser. What existing solutions you say? Go fish