fanticla has asked for the wisdom of the Perl Monks concerning the following question:
Dear Monks,
a rather silly problem: I'm using the WWW::Dict::Leo::Org module (it gets and parses the content of an online dictionary, HTML in utf8) but I am exeriencing the following encoding problem: I cannot properly show characters such as äüö.
The script is quite simple:
use strict; use warnings; use WWW::Dict::Leo::Org; use Data::Dumper; my $leo = new WWW::Dict::Leo::Org(); my @matches = $leo->translate("test"); open (OUT, "output.txt"); binmode(OUT, ":utf8"); print OUT Dumper(\@matches); close OUT;
If I open output.txt, for example with notepad++, I see that the encoding is right (utf8), but it fails to properly show characters such as äüö.
If I do not explicitly declare the utf flag (the HTML site is utf8 coded) and I open output.txt, I get a Ansii coded file. äüö are not correctly displayed. If I change the encoding in notepad++ from ansii to utf8, all characters are displayed right!
Anyone has a suggestion what I am doing wrong? Thanks, Cla
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: WWW::Dict::Leo::Org encoding issue
by Corion (Patriarch) on Jun 13, 2010 at 17:43 UTC | |
Are you sure that your tool, Notepad++, understands when a file is encoded as UTF8 when opening it? | [reply] |
|
Re: WWW::Dict::Leo::Org encoding issue
by ikegami (Patriarch) on Jun 13, 2010 at 18:49 UTC | |
I'm confused. So it's working fine? Or do you want to output "ANSI" (which is probably really cp1252). | [reply] |
by fanticla (Scribe) on Jun 13, 2010 at 19:32 UTC | |
notepad++ normally recognizes the encoding correctly. I want the output to be in utf-8. I read it at a later stage to display the text in a Text widget. The reading works so:
Of course the text wiget doesnt show corectly the äöü. I am not an expert of encodings, but I could cope with all other encoding issues so far... | [reply] [d/l] |
by ikegami (Patriarch) on Jun 13, 2010 at 19:38 UTC | |
| [reply] [d/l] |
by fanticla (Scribe) on Jun 13, 2010 at 20:28 UTC | |
by Corion (Patriarch) on Jun 13, 2010 at 19:35 UTC | |
Are you sure that your (Tk?) widget understands UTF8? | [reply] |
|
Re: WWW::Dict::Leo::Org encoding issue
by graff (Chancellor) on Jun 14, 2010 at 03:09 UTC | |
looks like you are opening OUT for read access, then trying to write to it. And you're not checking for any errors, so when something goes wrong, you don't hear about it. So, your script is not changing the contents of the file. Try opening for write access -- the nicest way would be: BTW, I think Data::Dumper will make sure to convert unicode characters to their "\x{h*}" form, rather than printing actual utf8-encoded byte strings. ikegami's point about printing a BOM character first is simply that many tools (including Notepad, Wordpad and other M$ utils) rely on a file-initial BOM as a sort of "magic word" that tells the tool how it should interpret the file contents. So, after the kind of open statement shown above, I would do:
| [reply] [d/l] [select] |
|
Re: WWW::Dict::Leo::Org encoding issue
by wwe (Friar) on Jun 14, 2010 at 10:50 UTC | |
I'm using Notepad2 for checking the file. See the code here: The file contains: Updated: fixed spelling mistakes | [reply] [d/l] [select] |
|
Re: WWW::Dict::Leo::Org encoding issue
by Krambambuli (Curate) on Jun 14, 2010 at 15:20 UTC | |
Looking with an hex-viewer into your output.txt should allow for a first important divide: is the file containing what you want or need, as you want or need ? Once this clarified, you'll have a handle to track down the issue - towards Perl, towards notepad++/Windows or even in both directions. | [reply] |
by ikegami (Patriarch) on Jun 14, 2010 at 17:49 UTC | |
Apparently so, since everything appears correctly in the editor once he switches it to the right mode. The question he's asking is how to convince his editor to automatically switch to the right mode (UTF-8 encoded instead of "ANSI" encoded). | [reply] |
|
Re: WWW::Dict::Leo::Org encoding issue
by Yary (Pilgrim) on Jun 14, 2010 at 17:44 UTC | |
Try adding this line after "binmode":
| [reply] [d/l] |
by ikegami (Patriarch) on Jun 14, 2010 at 17:53 UTC | |
is simpler than equivalent
| [reply] [d/l] [select] |