in reply to Re^2: NCR & CER to UTF-8
in thread NCR & CER to UTF-8
1) When you use perl with data that isn't ASCII, it's generally a good idea to tell perl what encoding you expect the filehandles to use, by using binmode: binmode(\*STDOUT, ':utf8');. By default, even in utf8 locales, STDIN, STDOUT, and STDERR are assumed to be latin-1. If you had a use warnings, you would have gotten a lot of warnings about a wide character in print -- one for every character that wasn't in latin-1.
2) From that screenshot, it appears that what you have is mostly valid utf8, but the thing you are using to view it expects it to be latin-1, not utf8.
3) It's dangerous to HTML-unescape text containing unescaped HTML or XML tags; after doing so it is impossible to tell the difference between what was <foo> and <foo>, making tags out of things that were not tags before.
4) Why aren't you using a HTML parser, such as HTML::TreeBuilder?
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^4: NCR & CER to UTF-8
by vnpenguin (Beadle) on Nov 01, 2005 at 16:14 UTC |