myfrndjk has asked for the wisdom of the Perl Monks concerning the following question:

Hi I wish to print Japanese characters in HTML file as a crawl content. I tried to encode using both (cp1252) and (UTF-8) while printing in HTML, however I can't see those Japanese characters in that, instead I am getting some junk values. For example " 【外資系転職の " is printed as " ÂyŠOŽ‘Œn“] ". Thanks in advance

use strict; use warnings; use HTML::TreeBuilder::XPath; use LWP::UserAgent; use HTTP::Request; use HTML::Entities; use HTML::Strip; use Encode qw( decode_utf8 encode_utf8 ); open( OUT, '>:utf8', "C:/Users/jeyakuma/Desktop/test1.html" ); my $URL = 'http://job.japantimes.com/'; my $agent = LWP::UserAgent->new( agent => "Mozilla/5.0" ); my $request = HTTP::Request->new( GET => $URL ); my $response = $agent->request($request); # Check the outcome of the response if ( $response->is_success ) { my $xp = HTML::TreeBuilder::XPath->new_from_content( $response->de +coded_content ); my $raw_html = $xp->findnodes_as_string( '//td[@class="text12"]'); my $hs = HTML::Strip->new(); my $clean_text = $hs->parse($raw_html); $clean_text = decode_utf8($hs->parse(encode_utf8($raw_html))); $hs->eof; print OUT $clean_text; } elsif ( $response->is_error ) { print "Error:$URL\n"; print $response->error_as_HTML; }
  • Comment on How to print(encode) japanese characters in HTML using perl crawler
  • Download Code

Replies are listed 'Best First'.
Re: How to print(encode) japanese characters in HTML using perl crawler
by zwon (Abbot) on Nov 09, 2014 at 15:26 UTC
    Works for me. I guess the problem is that the tool you're using to see the output file doesn't know it is in UTF-8.

      Hi I tried to open with both firefox and chrome in both i am getting the junk values. Is there any other reason for this

        Yes. Firefox isn't recognizing the proper character encoding. If you go to View -> Character Encoding and select "Unicode", you should see the proper characters.

        Note that even though you've named your output files with the .html extension, it's not actually an html file, it's just a raw unicode text file. As such it lacks any of the metadata that comes along with web pages to help browsers render them correctly.