hexdump -C etest.txt
00000000 57 65 72 20 42 61 72 62 61 72 61 20 6c 69 76 65 |Wer Barba
+ra live|
00000010 20 65 72 6c 65 62 65 6e 20 6d c3 b6 63 68 74 65 | erleben
+m..chte|
00000020 2c 20 68 61 74 20 69 6e 20 4d c3 bc 6e 63 68 65 |, hat in
+M..nche|
00000030 6e 20 69 6d 6d 65 72 20 77 69 65 64 65 72 20 64 |n immer w
+ieder d|
00000040 69 65 20 47 65 6c 65 67 65 6e 68 65 69 74 2c 20 |ie Gelege
+nheit, |
00000050 73 69 65 20 73 69 6e 67 65 6e 20 7a 75 20 68 c3 |sie singe
+n zu h.|
00000060 b6 72 65 6e 2e 20 42 65 73 6f 6e 64 65 72 65 20 |.ren. Bes
+ondere |
00000070 41 75 66 74 72 69 74 74 65 20 77 65 72 64 65 20 |Auftritte
+ werde |
00000080 69 63 68 20 61 62 20 73 6f 66 6f 72 74 20 69 6d |ich ab so
+fort im|
00000090 20 41 6e 73 63 68 6c 75 c3 9f 20 61 6e 20 64 69 | Anschlu.
+. an di|
000000a0 65 20 45 6e 67 65 6c 77 6f 72 74 65 20 61 6e 6b |e Engelwo
+rte ank|
000000b0 c3 bc 6e 64 69 67 65 6e 2e 0a 0a |..ndigen.
+..|
000000bb
The above is a cut-n-paste from the webpage.html -- both the html and the txt show the same missing characters.
I did read several things about UTF-8. I suppose the confusion lies in => if I create the file, I get my Latin-1. If I didn't create the file, there is only ASCII. | [reply] [d/l] |
The snippet you show is encoded in UTF-8.
Next step: determine the encoding of the file in which umlauts display correctly on your terminal.
Or even better: configure a clean UTF-8 enivronment.
I suppose the confusion lies in => if I create the file, I get my Latin-1. If I didn't create the file, there is only ASCII.
I'm confused indeed. If you don't create a file, it doesn't exist, neither with ASCII nor with UTF-8.
Speaking of confusion, I think you try to achieve too much in one step. For example the title of your question metions HTML::Parser, which doesn't appear in the posting at all.
So, small steps:
- Make sure you know which encoding your terminal understands. There's no point in proceeding before you have done this step.
- Find out what encodings your source files are. Seems to be UTF-8.
- In your perl scripts, decode everything coming from the outside (except when a module does it for you), and encode everything. use utf8;, and write your program files in UTF-8.
- If something doesn't work, find out where you violate any of the points of the previous steps.
Perl 6 - links to (nearly) everything that is Perl 6.
| [reply] [d/l] |
#!/usr/bin/perl -w
# legaget.pl
use strict;
use Encode;
my $filename = "engleword.html";
open FILE, "<", $filename or die $1;
while( my $line = <FILE> ) {
print encode( "utf8",$line);
}
close(FILE);
What I have learned...
- use utf8; is for Unicode source code, filenames, deals with legacy stuff, not for encoding.
- I still have to grab the html and write to a file, I would still like to encode the string in place. Maybe later.
I have come across this encode problem as a graphic artist. Customers used MSWord to generate text and then pasted the resulting text into html, or Adobe Pagemaker, PDF, etc. and everything is just hunky-dory on a WinBox, but on a Mac or Linux the results had missing characters. MS was late adopting Unicode. MS thought they had another answer with OpenType (I think it was) a fonts technology in partnership with Adobe. That fell apart. But in pre-XP MS text products the first byte set the encode for the text file. I used to have a FreeWare program on the PC that manually changed that byte.
Forgive me, I worked on this silly problem all day, but I'm loving Perl. | [reply] [d/l] [select] |