comment on

Okay. I tried your sample with ActiveState's 5.6.1 perl and a russian page I found through google and got garbage. I tried the same with perl 5.8.6 (cygwin) and also got garbage, but got this helpful warning message:

Parsing of undecoded UTF-8 will give garbage when decoding entities at
+ /usr/lib/perl5/site_perl/5.8/cygwin/HTML/Parser.pm line 104.
[download]

For reference, the document I was using was http://www.ras.ru/about.aspx?_Language=ru.

Now, this indeed looks like character set issues. Namely, the document I had was encoded in utf8, but perl assumed it was encoded in iso-latin-1. So, I modified the script to assume that the document was encoded in utf8:

#!perl -w

use HTML::TreeBuilder;
use diagnostics;
use strict;

my $root = HTML::TreeBuilder->new;
open(MYFILE, '<:utf8', 'sample_document.htm');
while (<MYFILE>) {$root->parse($_);}
$root->eof();

my @paras = $root->find_by_tag_name('p');

foreach my $h (@paras) {

        foreach my $item_r ($h->content_refs_list) {
                next if ref $$item_r;

                ### proprietary JavaScript/HMTL inserted with substitu
+tion
        }

} # end foreach

print $root->as_HTML;
[download]

And then when I ran it, I got a document that looked very different from what went in, but looked identical in a web browser. So this is the solution for utf8 documents.

But what about in general? After all, you can't assume that all incoming documents will be utf-8. Well, in general you won't be working from the file system, you'll be pulling stuff via http. The nice thing about that is that with http you're able to determine the content type from the headers, usually.

After working on it a bit, I have this that is successful in general, but requires perl 5.8. It should be easy to rework into an HTTP proxy using the HTTP::Proxy module (it looks like more code than it really is, since perltidy tends to put in excessive spaces):

use LWP::UserAgent;
use HTML::Parser;
use HTML::TreeBuilder;
use Encode;

use strict;

my $ua = LWP::UserAgent->new;
$ua->timeout(10);
$ua->env_proxy;

my $charset = undef;

sub set_charset_from_content_type {
    if ( $_[0] =~ /.*; charset=(\S+)/ ) { $charset ||= $1; }
}

# This parser is active only until we get the charset
my $mini_parser = HTML::Parser->new(
    api_version => 3,
    start_h     => [
        sub {
              $_[0] eq 'meta'
              and $_[1]->{'http-equiv'}
              and lc( $_[1]->{'http-equiv'} ) eq 'content-type'
              and set_charset_from_content_type( $_[1]->{'content'} );
        },
        "tagname, attr"
    ],
    end_h => [
        sub {
            $_[0] eq 'head'
              and do { $charset ||= "iso-8859-1" }
        },
        "tagname"
    ]
);
# This doesn't do what you think it does - it does something
# strange; see the HTML::Parser documentation
$mini_parser->utf8_mode(1);

my $root = HTML::TreeBuilder->new;

my $isfirst          = 1;
my $unencoded_buffer = '';
my $result           = '';

sub process_lwp_response {
    my ( $chunk, $resp_object ) = @_;
    $unencoded_buffer .= $chunk;
    if ( !$charset ) {
        if ($isfirst) {
            $isfirst = 0;
            set_charset_from_content_type(
                $resp_object->header('Content-Type') );
        }
        $mini_parser->parse($chunk);
    }
    if ($charset) {
        $mini_parser = undef;
        $root->parse( decode( $charset, $unencoded_buffer, Encode::FB_
+QUIET ) );
    }
}

my $targeturl = 'http://www.ras.ru/about.aspx?_Language=ru';
#   $targeturl = shift;

my $response = $ua->get( $targeturl, ':content_cb' => \&process_lwp_re
+sponse );

if ( $response->is_success ) {
    $root->eof();

    # original code
    my @paras = $root->find_by_tag_name('p');

    foreach my $h (@paras) {

        foreach my $item_r ( $h->content_refs_list ) {
            next if ref $$item_r;

            ### proprietary JavaScript/HMTL inserted with substitution
        }

    }    # end foreach
    print $root->as_HTML;
}
else {
    die $response->status_line;
}
[download]

Update: To work completely properly, this really needs the HTML::Parser patch I mention below. However, that's an HTML::Parser bug; this code would be fine if HTML::Parser behaved better in utf-8 environments.

@/=map{[/./g]}qw/.h_nJ Xapou cets krht ele_ r_ra/;
map{y/X_/\n /;print}map{pop@$_}@/for@/
[download]

In reply to Re^3: Parsing HTML and Inserting JavaScript/HTML into Documents by fizbin
in thread Parsing HTML and Inserting JavaScript/HTML into Documents by hackdaddy

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


There's more than one way to do things
	PerlMonks