Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks.

I'm trying to take a webpage, and encode the entities which show up only in the <body>...</body> portion of the webpage, and ONLY those that are in the "high" entity region. I don't want to encode brackets like < and >, and I don't want to make the final output of the page broken when viewed in a browser or web validator. Basically I want to encode all the umlauts and other "foreign language" entities. I've looked at HTML::Entities, but I'm not sure how to only process those in the body, and those of a specific asciibetical value.

What I've come up with so far is this:

s/([\200-\377])/sprintf "&#%d;", ord $1/ge;

Is there a better way?

Replies are listed 'Best First'.
Re: Encoding entities ONLY in the <body></body> of a webpage
by valdez (Monsignor) on Jun 14, 2003 at 14:04 UTC

    You could also use HTML::TokeParser::Simple by Ovid. Given your requirements, here it is a little program that recognizes the body and encodes texts but not tags:

    #!/usr/bin/perl use HTML::TokeParser::Simple; use HTML::Entities; use strict; use warnings; use vars qw($filename $parser $in_body); die "usage: $0 <filename>" unless $filename = shift @ARGV; $parser = HTML::TokeParser::Simple->new( $filename ); $in_body = 0; while ( my $token = $parser->get_token ) { if ($in_body) { # we are inside BODY if ($token->is_text) { # it's text, convert it print HTML::Entities::encode_entities($token->as_is); } else { if ($token->is_end_tag( 'body' )) { # we've found the end of the BODY $in_body = 0; } print $token->as_is; } } else { if ($token->is_start_tag( 'body' )) { # we've found the beginning of the BODY $in_body = 1; } print $token->as_is; } }

    Ciao, Valerio

Re: Encoding entities ONLY in the <body></body> of a webpage
by little (Curate) on Jun 14, 2003 at 13:07 UTC

    you might look into HTML::Parser and HTML::TokeParser, then utilize the one or the other to get only the content of the body element and process that further.

    On the other hand I'd like to point to the fact that meta tags for keywors and description and title also can and mostly will contain entities or characters that should be replaced with entities to be properly displayed.

    Have a nice day
    All decision is left to your taste