GaijinPunch has asked for the wisdom of the Perl Monks concerning the following question:

Hi

If I dig through here where I asked this same question. However, the code from that one isn't working anymore. I'm assuming a new change in Mechanize.

I'm actually using WWW::Scripter, which is a subclass of Mechanize. I'm get()ing some Japanese pages which are encoded in euc-jp. The text is all converted to and spit out in utf-8... hex representations. The browsers generally read this fine, so it's not the end of the world as I'm matching up to about 10 strings. However, I'd like to be able to read it if possible.
use WWW::Scripter; use Encode qw(from_to_; use Jcode; my $w = new WWW::Scripter(); $w->get( 'www.page.html' ); my $html1 = $w->content(); from_to( $html1, 'utf8', 'euc-jp' ); # No good my $html2 = Jcode->new( $w->content() )->euc(); # No good but worked w +ith older version of Mechanize.
The Japanese strings come back with a Unicode hex representation of each characters, as found here.

I've read around on a few Japanese blogs and whatnot. Looks like some people are changing the contents of HTML::Message.pm but I can't find out exactly where.

Replies are listed 'Best First'.
Re: WWW::Mechanize & encoding
by moritz (Cardinal) on Jun 28, 2011 at 13:03 UTC
    If you mean entities like ゛ and &#x309B, those are not UTF-8 specific, but they give you the codepoint.

    You can convert that to "normal" characters (and not HTML entitites) with HTML::Entities. You can encode the resulting string with Encode::encode in any encoding you like (and which supports those characters).

Re: WWW::Mechanize & encoding
by Anonymous Monk on Jun 28, 2011 at 13:44 UTC

    The Japanese strings come back with a Unicode hex representation of each characters, as found here.

    What does that mean?

    Are the bytes wrong, or are you getting html entities, or what?

    Please make sure your code compiles and structure your code as an effective bug report aka a test-case, for example

    #!/usr/bin/perl -- use WWW::Scripter; use Encode qw' from_to '; use Jcode; use URI::file; use File::Temp; use Test::More tests => 3; my $fh = File::Temp->new( SUFFIX => '.html' ); my $filename = $fh->filename ; my $uri = URI::file->new_abs( $filename )->as_string; print $fh <<"__HTML__"; <html><head> <title> title \x62\x6c\x61\x68 </title> </head><body> \x62\x6c\x61\x68\x20\x62\x6c\x61\x68\x20\x62\x6c\x61\x68 \xFF\xFF\xFF </body></html> __HTML__ ok(close $fh, "write tempfile "); my $w = WWW::Scripter->new ( qw/ autocheck 1 /); #~ my $w = WWW::Mechanize->new ( qw/ autocheck 1 /); $w->get( $uri ); my $html1 = $w->content(); from_to( $html1, 'utf8', 'euc-jp' ); my $html2 = Jcode->new( $w->content() )->euc(); is( $html1, "something", "something blah"); is( $html2, "something else", "something else blah"); __END__ $ prove pm.911748.pl pm.911748.pl .. 1/3 # Failed test 'something blah' # at pm.911748.pl line 43. # got: '<html><head> # <title> title blah </title> # </head><body> # blah blah blah # &yuml;&yuml;&yuml; # </body> # </html>' # expected: 'something' # Failed test 'something else blah' # at pm.911748.pl line 44. # got: '<html><head> # <title> title blah </title> # </head><body> # blah blah blah # &yuml;&yuml;&yuml; # </body> # </html>' # expected: 'something else' # Looks like you failed 2 tests of 3. pm.911748.pl .. Dubious, test returned 2 (wstat 512, 0x200) Failed 2/3 subtests Test Summary Report ------------------- pm.911748.pl (Wstat: 512 Tests: 3 Failed: 2) Failed tests: 2-3 Non-zero exit status: 2 Files=1, Tests=3, 1 wallclock secs ( 0.06 usr + 0.00 sys = 0.06 CPU +) Result: FAIL
      Means I'm getting the HTML entities.

      Sorry --- wasn't joking about the drinking part. Rough couple of weeks. I will post the code as per your instruction, but it's pretty late here, and I've got another day of computers beating the crap out of me in just a few hours. :/

      cheers

        Means I'm getting the HTML entities.

        WWW::Mechanize warns about content not being the same as res->content

        and indeed, in my program, ->content has entities but res->content doesn't.

        It make sense, since WWW:Scripter uses HTML::DOM/innerHTML to get content.

        I'm not sure if Father Chrysostomos considers this the correct behaviour, but it matches my experience with Firefox browser.

        View-Source/Ctrl+U is different from document.body.parentNode.innerHTML

Re: WWW::Mechanize & encoding
by Anonymous Monk on Jun 28, 2011 at 13:11 UTC
    WWW::Mechanize already decodes the character encoding (here: EUC-JP) implicitely. Jcode is from the Perl4 era, always use the Encode module instead. This works:
    use WWW::Scripter qw(); use HTML::Entities qw(decode_entities); my $w = WWW::Scripter->new; $w->get('file:///tmp/nhg.euc-jp.html'); decode_entities $w->content; # returns a Perl string, use Encode::enco +de to prepare it for output.
      Thanks, but I seem to have hit a snag w/ Encoder
      use Encode::Encoder qw(encoder) HTML::Entities qw(deocde_entities); my $w = WWW::Scripter->new; $w->get("file:///tmp/nhg.euc-jp.html") decode_entities $w->content; # this is okay I think my $euc = encoder( $w->content )->euc_jp; # this gives an error
      I assume my syntax for encoder() is wrong:
      "\x{00bd}" does not map to euc-jp at /usr/lib64/perl5/vendor_perl/5.12.2/x86_64-linux/Encode/Encoder.pm line 88.

      Shouldn't have started drinking so early..

        You're doing it wrong because you are not paying attention and you are writing sloppy code.

        I said in my code comment decode_entities returns something. In your code, you discard the return value, but you need to assign it to a variable or put it as parameter for a function if you want make use of it.

        The function is named not deocde_entities, but decode_entities.

        Your code is lacking two ; statement separators.

        use WWW::Scripter qw(); use HTML::Entities qw(decode_entities); use Encode qw(encode); my $w = WWW::Scripter->new; $w->get('file:///tmp/nhg.euc-jp.html'); print encode('EUC-JP', decode_entities($w->content)); # output octets +go to STDOUT, encoded as EUC-JP