Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Re: Unicode Woes

by BigLug (Chaplain)
on Oct 01, 2004 at 09:47 UTC ( [id://395578]=note: print w/replies, xml ) Need Help??


in reply to Unicode Woes

For those who want some code to play with:
#!/usr/bin/perl use URI::Escape; use Encode; require LWP::UserAgent; my $escape = uri_escape(join('. ', @ARGV)); my $ua = LWP::UserAgent->new; my $response = $ua->get("http://babelfish.altavista.com/tr?trtext=$esc +ape&lp=en_ja"); if ($response->is_success) { $result = $response->content; # or whatever } else { die $response->status_line; } Encode::_utf8_on( $result ); my ($translation) = $result =~ /\Q<td bgcolor=white class=s><div style +=padding:10px;>\E(.+?)\Q<\/div>\E/; $original = $translation; $translation=~s/([^[:ascii:]])/sprintf("\\x{%.4x}",ord $1)/ge; print $translation ."\n". length($original) ."\n". ord(substr($origina +l,0,1));
Run this (at least on my machine) and $translation has no visible contents, yet it has a length of 5!

If your machine gives you something sensible, please let me know.

(You can probably remove the Encode calls there .. that was just making sure that the resulting string *was* in utf8 according to perl)


Cheers!
Rick
If this is a root node: Before responding, please ensure your clue bit is set.
If this is a reply: This is a discussion group, not a helpdesk ... If the discussion happens to answer a question you've asked, that's incidental.

Replies are listed 'Best First'.
Re^2: Unicode Woes
by Anonymous Monk on Oct 01, 2004 at 11:37 UTC
    $translation has no visible contents
    Using Data::Dumper and redirecting output into a file shows that it consists of ASCII NULs. Already $response as created by LWP is wrong that way. I tried with LWP::Simple, same thing. use open ':utf8'; does not help.
Re^2: Unicode Woes
by graff (Chancellor) on Oct 01, 2004 at 22:57 UTC
    Hrrm. I'm not well versed with LWP stuff. I went to that web site with a browser, typed in an English word and got back a Japanese word (in utf8) -- that's fine (the page source had nothing strange about it). I tried wget from the command line with the url string that you would post to get that same translation:
    $ wget -O /tmp/junk 'http://babelfish.altavista.com/tr?trtext=tree&lp +=en_ja'
    and I think wget gave me the same output that went to the browser -- that's fine. (But when I tried again later, it gave me a null byte where the Japanese should have been. Having overwritten the original try, I can't be sure now.)

    When I run your test script, $translation ends up with a null byte. I tried printing $result to STDERR, and redirected that to a file. The file (i.e. the full web page content returned by LWP->get) had null bytes where the browser (and maybe wget) output had a Japanese character.

    So I'm guessing there is something wrong with how you are making or sending the request to the server, but I can't imagine what to try next in order to figure out the problem and fix it. Good luck.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://395578]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (4)
As of 2024-04-23 07:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found