Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

In response to node 499980, I've written up this example of how to use HTTP::Proxy to mangle html on the fly, while paying attention to character set issues. Character set issues in html are an annoying pain, and it is quite possible I haven't accounted for absolutely everything yet. However, I've accounted for enough to successfully handle all the character set issues different translations of http://www.debian.org/, which by the way is a wonderful example set of non-utf8, non-latin1 encoded pages.

It seems to me that answers of the form "use HTTP::Proxy" happen occasionally around here and it would be a good idea if this example were somewhere in CUFP.

The example mangling is to place a paragraph sign (¶) after each (non-empty) paragraph that links to the start of that paragraph, but this is easily customizeable. In fact almost all of the code deals with character set issues; the mangling is just a tiny bit at the end. This code also shows an example of using an HTML::TreeBuilder in the mangling process, which is a bit tricky since it requires that the entire document be read first.

First, a bit of explanation: this code contains several package statements and it might be a bit confusing at first what each package is doing there. However, this bit of code from the main package should help to clarify things:

$proxy->push_filter( mime => 'text/html', response => utf8HeaderFilter->new(), response => FromLocalFilter->new('iso-8859-1'), response => RemoveContentTypeMeta->new(), response => HTMLNormalize->new(), response => HTTP::Proxy::BodyFilter::htmlparser->new($parser, + rw => 1), response => ToUtf8Filter->new() );
In order, those filters:
  • modify the headers of the response to say that the body is coming out as utf-8
  • Determine the input character set, and decode the incoming data into perl's unicode strings. (with a default character set of iso-8859-1)
  • Strip any <meta> tag in the output that may give the content type and character set. (Not necessary, since the headers override this, but nice if, e.g., one were to be saving the proxied files to disk)
  • Normalize the html so that each paragraph ends properly with a </p>
  • Use an HTML::Parser object to mangle the html to insert the ¶-links.
  • Encode the response body properly as utf-8.

So hoping that that teaser has whetted your appetite, here's the code itself:

#!perl # This package determines the local character set, and # translates everything into wide perl strings. package FromLocalFilter; use base qw( HTTP::Proxy::BodyFilter ); use HTML::Parser; use Encode; use strict; use warnings; # Build up a list of characters that need to get # substituted to get around HTML::Parser bug w.r.t. # utf-8 characters containing 0xA0 # See http://rt.cpan.org/NoAuth/Bug.html?id=15068 my $a0chars = ""; for (0x80..0xd7FF,0xE000..0xFDCF,0xFE00..0xFFFD) { my $c = chr($_); $a0chars .= $c if encode(q[utf-8],$c) =~ /\xA0/; } our $a0regexp = qr/([$a0chars])/; # End HTML::Parser bug workaround; the only other # line relating to working around this bug is below where # I do an s/// substitution based on $a0regexp sub set_charset { my ($self,$cset) = @_; ### Debugging: ### print "Charset: $1\n"; return unless $cset; $cset = Encode::resolve_alias($cset) || lc($cset); $cset =~ s/\s/-/g; if (! $self->{charset} ) { $self->{charset} = $cset; $self->{charseto} = find_encoding($cset); die "Unsupported character set: $cset" unless $self->{char +seto}; } elsif ($self->{charset} ne $cset) { warn join(" ", "Charset $cset attempted, but charset", $self->{charset}, 'already active for', $self->{current_uri}); } } sub set_charset_from_content_type { if ($_[1] and $_[1] =~ /.*; *charset=(\S+)/) { $_[0]->set_charset($1); } } sub default_charset { my $self = shift; if (!$self->{charset}) { my $default_charset = $self->{default_charset}; $self->set_charset($default_charset); } } sub init { my ($self, $default_charset) = @_; $self->{default_charset} = $default_charset; } sub get_parser { my ($self) = @_; my $mini_parser = HTML::Parser->new(api_version => 3, start_h => [ sub { $_[0] eq 'meta' and $_[1]->{'http-equiv'} and lc($_[1]->{'http-equiv'}) eq 'content-type' and $self->set_charset_from_content_type($_[1]->{'co +ntent'}) }, "tagname, attr" ], end_h => [ sub { if ($_[0] eq 'head') { $self->default_charset; $_[1]->eof; # abort parsing $self->free_parser; } }, "tagname,self" ], end_document_h => [ sub { $self->default_charset; $self->free_parser; }, ""], default_h => [""] ); $mini_parser->utf8_mode(1); $mini_parser; } sub free_parser { my ($self) = @_; $self->{mini_parser} = undef; } sub begin { my ( $self, $message) = @_; $self->{charset} = ''; $self->{current_uri} = $message->request->uri; if ( $message->header('Content-Type') ) { $self->set_charset_from_content_type($message->header('Content +-Type')); } $self->{mini_parser} = $self->get_parser(); $self->{mini_parser_dist} = 0; } sub filter { my ( $self, $dataref, $message, $protocol, $buffer ) = @_; if ($self->{mini_parser}) { my $pstr = substr($$dataref, $self->{mini_parser_dist}); $self->{mini_parser}->parse($pstr); } if ( !defined($buffer) ) { if ($self->{mini_parser}) {$self->{mini_parser}->eof;} my $d=''; $buffer = \$d; } $$buffer = $$dataref; $$dataref = ''; if ($self->{charset}) { my $decoded = decode($self->{charseto}, $$buffer, Encode::FB_Q +UIET); # HTML::Parser bug workaround: (http://rt.cpan.org/NoAuth/Bug. +html?id=15068) $decoded =~ s/$a0regexp/sprintf('&#x%02X;',ord($1))/ge; $$dataref = $decoded; } $self->{mini_parser_dist} = length ($$buffer); } # Implements Appendix F.1 (Detection Without External Encoding Informa +tion) # of the XML Recommendation (http://www.w3.org/TR/2004/REC-xml-2004020 +4/#sec-guessing) # # As a plus side, rewrites an initial xml declaration (or adds a new o +ne) # so that the named character set is utf-8 (which is accurate given th +e other # transformations we want to apply) package FromLocalXmlFilter; use base qw( FromLocalFilter ); use Encode; use strict; use warnings; # regexp, charset, amt. to strip, provisional charset my $enctable = [ [qr/^ \x00 \x00 \xFE \xFF/x, 'UCS-4BE', 4, ''], [qr/^ \xFF \xFE \x00 \x00/x, 'UCS-4LE', 4, ''], [qr/^ \x00 \x00 \xFF \xFE/x, 'UCS-42143', 4, ''], [qr/^ \xFE \xFF \x00 \x00/x, 'UCS-43412', 4, ''], [qr/^ \xFE \xFF/x, 'UTF-16BE', 2, ''], [qr/^ \xFF \xFE/x, 'UTF-16LE', 2, ''], [qr/^ \xEF \xBB \xBF/x, 'UTF-8', 3, ''], [qr/^ \x00 \x00 \x00 \x3C/x, '', 0, 'UCS-4BE'], [qr/^ \x3C \x00 \x00 \x00/x, '', 0, 'UCS-4LE'], [qr/^ \x00 \x00 \x3C \x00/x, '', 0, 'UCS-42143'], [qr/^ \x00 \x3C \x00 \x00/x, '', 0, 'UCS-43412'], [qr/^ \x00 \x3C \x00 \x3F/x, '', 0, 'UTF-16BE'], [qr/^ \x3C \x00 \x3F \x00/x, '', 0, 'UTF-16LE'], [qr/^ \x3C \x3F \x78 \x6D/x, '', 0, 'ASCII'], [qr/^ \x4C \x6F \xA7 \x94/x, '', 0, 'cp37'], # EBCDIC # default ]; sub get_parser { return undef; } sub begin { my ($self) = shift; $self->{init4} = 0; $self->{initgt} = 0; $self->{provisional_charset} = ''; $self->SUPER::begin(@_); } sub filter { my $self = shift; my ( $dataref, $message, $protocol, $buffer ) = @_; if (! $self->{init4}) { if (length($$dataref) >= 4) { $self->{init4} = 1; for (@$enctable) { if ($$dataref =~ $_->[0]) { $self->set_charset($_->[1]); substr($$dataref,0,$_->[2],''); $self->{provisional_charset} = $_->[1] || $_->[3]; } if (! $self->{provisional_charset} ) { $self->default_charset; $self->{provisional_charset} = $self->{charset}; } } } else { # come back later if ($buffer) { $$buffer .= $$dataref; $$dataref = ''; return; } } } if (! $self->{initgt}) { my $predec = $$dataref; my $provstuff = decode($self->{provisional_charset}, $predec, Encode::FB_QUIET); if ($provstuff =~ />/) { $self->{initgt} = 1; if ($provstuff =~ /^(\<\? xml \s+ version \s* = \s* (?:'[^']*'|"[^"]*") \s+) (?:encoding \s* = \s* ('[^']*'|"[^"]*"|\S+))? (.*? \?\>)/x) { my ($preenc, $cset, $postenc) = ($1,$2,$3); $cset =~ s/^["']// if $cset; $cset =~ s/['"]$// if $cset; $self->set_charset($cset); $self->default_charset(); substr($provstuff,0,$+[0], qq[${preenc}encoding="utf-8"$postenc]); } else { $self->default_charset(); substr($provstuff, 0, 0, qq[<?xml version="1.0" encoding="UTF-8"?>\n]); } $$dataref = encode($self->{provisional_charset}, $provstuf +f) . $predec; } else { # come back later if ($buffer) { $$buffer .= $$dataref; $$dataref = ''; return; } } } $self->SUPER::filter(@_); } # This takes wide perl strings and outputs them as UTF8 package ToUtf8Filter; use base qw( HTTP::Proxy::BodyFilter ); use Encode; use strict; use warnings; sub filter { my ( $self, $dataref, $message, $protocol, $buffer ) = @_; $$dataref = encode('utf-8', $$dataref); } # This makes the header say that the document is in utf8, even if the # text says otherwise. (As it may after our filtering) package utf8HeaderFilter; use base qw( HTTP::Proxy::HeaderFilter ); use strict; use warnings; sub filter { my ($self, $headers, $message) = @_; ### Debugging: ### print "Header: ", $headers->header('Content-Type'), "\n"; if ($headers->header('Content-Type') and $headers->header('Content-Type') =~ m[^(application/xhtml(\+xml)?|text/html?)\s*($|;)]) { $message->header('Content-Type' => "$1; charset=utf-8") } } # This is only for debugging - if enabled, it prints out what # url is attempting to be loaded package outFilter; use base qw( HTTP::Proxy::HeaderFilter ); use strict; use warnings; sub filter { my ($self, $headers, $message) = @_; print "Going for: ", $message->uri, "\n"; } # This is another proxy filter used for debugging # Inserted into a filter chain, it logs the text as it passes through package MonitorBodyFilter; use base qw( HTTP::Proxy::BodyFilter ); use strict; use warnings; use Data::Dumper; sub filter { my ( $self, $dataref, $message, $protocol, $buffer ) = @_; local $Data::Dumper::Indent = 0; local $Data::Dumper::Useqq = 1; print Data::Dumper->Dump([$$dataref],[$self->{name} || '$dataref'] +), "\n"; } sub init { my ($self, $name) = @_; $self->{name} = $name; return $self; } # This filter removes any Content-Type information (including # charset info) from an included <meta> tag, because at this # point that information's inaccruate anyway. package RemoveContentTypeMeta; use HTML::Parser; use HTTP::Proxy::BodyFilter::htmlparser; use strict; use warnings; sub new { my $p = HTML::Parser->new(api_version => 3); $p->handler(default => sub { $_[0]->{output} .= $_[1]; }, 'self,text +'); $p->handler(start => sub { unless ($_[2] eq 'meta' and $_[3]->{'http-equiv'} and lc($_[3]->{'http-equiv'}) eq 'content-type') {$_[0]->{output} .= $_[1];} }, 'self,text,tag,attr'); return HTTP::Proxy::BodyFilter::htmlparser->new($p, rw => 1); } # This makes certain everything has closing tags, but does # it in a slightly extreme fashion by passing everything into # an HTML::TreeBuilder instance. # # This means it must wait for the entire document to be absorbed # before it does anything. However, this is a good example of # how to use HTML::TreeBuilder in a proxy filter. # # A less extreme example would use a simple HTML::Parser to handle # closing any open <p> elements before other <p> elements, <h?> # elements and <form>. (Or, depending on how strict you wanted to # be, before lists or other elements paragraphs aren't supposed to # contain) package HTMLNormalize; use HTML::TreeBuilder; use base qw( HTTP::Proxy::BodyFilter ); use strict; use warnings; sub begin { my ($self, $message) = @_; $self->{treeparser} = HTML::TreeBuilder->new(); } sub filter { my ( $self, $dataref, $message, $protocol, $buffer ) = @_; $self->{treeparser}->parse($$dataref); if ($buffer) { $$dataref=''; return } $self->{treeparser}->eof(); $$dataref = $self->{treeparser}->as_HTML('<>&', ' ', {}); $self->{treeparser}->delete(); } package main; use HTTP::Proxy; use HTTP::Proxy::BodyFilter::htmlparser; use strict; use warnings; my $proxy = HTTP::Proxy->new( port => 3128 ); # The actual subroutine to add an <a name=""> link my $pstartsub = sub { $_[0]->{output} .= $_[2]; if ($_[1] eq 'p') { $_[0]->{foundtext} = 0; $_[0]->{plnum}++; $_[0]->{output} .= '<a name="para_link_'; $_[0]->{output} .= $_[0]->{plnum}; $_[0]->{output} .= '"></a>'; } }; my $pendsub = sub { if ($_[1] eq 'p' and $_[0]->{foundtext}) { $_[0]->{output} .= '<a href="#para_link_'; $_[0]->{output} .= $_[0]->{plnum}; $_[0]->{output} .= qq[">\xB6</a>]; } $_[0]->{output} .= $_[2]; }; my $parser = HTML::Parser->new(api_version => 3); $parser->handler(start => $pstartsub, 'self,tagname,text' ); $parser->handler(end => $pendsub, 'self,tagname,text' ); $parser->handler(text => sub { $_[0]->{output} .= $_[1]; $_[0]->{foundtext} ||= ($_[1] =~ /\S/) +; }, 'self,text'); $parser->handler(default => sub { $_[0]->{output} .= $_[1]; }, 'self,t +ext'); $parser->handler(start_document => sub { $_[0]->{plnum} = 0; } ); my $xparser = HTML::Parser->new(api_version => 3); $xparser->handler(start => $pstartsub, 'self,tagname,text' ); $xparser->handler(end => $pendsub, 'self,tagname,text' ); $xparser->handler(default => sub { $_[0]->{output} .= $_[1]; }, 'self, +text'); $xparser->handler(text => sub { $_[0]->{output} .= $_[1]; $_[0]->{foundtext} ||= ($_[1] =~ /\S/ +);}, 'self,text'); $xparser->handler(start_document => sub { $_[0]->{plnum} = 0; } ); $xparser->xml_mode(1); # Debugging: # $proxy->push_filter( request => outFilter->new() ); $proxy->push_filter( mime => 'text/html', response => utf8HeaderFilter->new(), response => FromLocalFilter->new('iso-8859-1'), response => RemoveContentTypeMeta->new(), response => HTMLNormalize->new(), response => HTTP::Proxy::BodyFilter::htmlparser->new($parser, + rw => 1), response => ToUtf8Filter->new() ); # To catch application/xhtml+xml $proxy->push_filter( mime => 'application/xhtml', response => utf8HeaderFilter->new(), response => FromLocalXmlFilter->new('utf-8'), response => HTTP::Proxy::BodyFilter::htmlparser->new($xparser +, rw => 1), response => ToUtf8Filter->new() ); $proxy->start;
Update: Added character set detection and handling for xhtml documents according to Appendix F of the XML recommendation.
--
@/=map{[/./g]}qw/.h_nJ Xapou cets krht ele_ r_ra/; map{y/X_/\n /;print}map{pop@$_}@/for@/

In reply to Mangle HTML on-the-fly using HTTP::Proxy by fizbin

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (4)
As of 2024-03-28 18:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found