comment on

reading... html to produce... html

I prefer to avoid jiggering about with the encoding too. In my experience it always ends in tears. :-)

Here's what I do (I'm using HTML::TokeParser::Simple in this case).

#!/usr/bin/perl

use strict;
use warnings;
use lib 'lib';
use MyParser;

my ($p, $txt);
my $html = do{local $/;<DATA>};
$p = MyParser->new(file => 'test.html')
  or die "can't parse: $!\n";
  
$txt = $p->get_title;
print "$txt\n";

$p->get_tag('p');
$txt = $p->get_txt('p'); # upto a closing p tag
print "*$txt*\n";

__DATA__
<html>
<head>
<title>egrave: &egrave; : eacute : &eacute; : rsquo: &rsquo; : lsquo: 
+&lsquo;</title>
</head>
<body>

<p>
  one 
    <span class="second">two</span> 
    three 
    <br>
    four 
five six

</p>
</body>
</html>
[download]

MyParser.pm

package MyParser;

use strict;
use warnings;
use HTML::TokeParser::Simple;

use base qw(HTML::TokeParser::Simple);

sub get_title{
  my ($self) = @_;
  $self->get_tag('title') or return;
  $self->get_txt('title');
}

sub get_txt{
  my ($self, $tag) = @_;
  my ($txt);
  while (my $t = $self->get_token){
    last if $t->is_end_tag($tag);
    next if $t->is_start_tag or $t->is_end_tag;
    $txt .= $t->as_is if $t->is_text;
  }
  for ($txt){
    s/\n/ /g;
    s/^\s+//;
    s/\s+$//;
    s/\s+/ /g;
  }
  return $txt;
}

1;
[download]

If, like me, getting the title is a frequent task you can add a method to the wrapper to do that as I've done here. I've also tried to emulate HTML::TokeParser's get_trimmed_text

It skips any tags found before the required end tag (although this won't apply to titles).

I'd be interested in hearing comments from monks if any glaring fopahs have been committed. :-)

In reply to Re^3: HTML::TokeParser, get_text scrambling rsquo and lsquo by wfsp
in thread HTML::TokeParser, get_text scrambling rsquo and lsquo by tridral

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.