Re^3: HTML::Parser, file, print to Terminal

If you show the output from

hexdump -C your-file-with-ligatures.txt
[download]

We can determine the encoding of you.

What I meant with binmode is that instead of

use open ":encoding(UTF-8)";
[download]

Try

binmode STDOUT, ":encoding(UTF-8)";
[download]

Also I recommend to read up on Encodings, Unicode and how Perl handles them.

Comment on Re^3: HTML::Parser, file, print to Terminal Select or Download Code

Replies are listed 'Best First'.
Re^4: HTML::Parser, file, print to Terminal by victor_charlie (Novice) on Jul 13, 2010 at 14:27 UTC
hexdump -C etest.txt 00000000 57 65 72 20 42 61 72 62 61 72 61 20 6c 69 76 65 \|Wer Barba +ra live\| 00000010 20 65 72 6c 65 62 65 6e 20 6d c3 b6 63 68 74 65 \| erleben +m..chte\| 00000020 2c 20 68 61 74 20 69 6e 20 4d c3 bc 6e 63 68 65 \|, hat in +M..nche\| 00000030 6e 20 69 6d 6d 65 72 20 77 69 65 64 65 72 20 64 \|n immer w +ieder d\| 00000040 69 65 20 47 65 6c 65 67 65 6e 68 65 69 74 2c 20 \|ie Gelege +nheit, \| 00000050 73 69 65 20 73 69 6e 67 65 6e 20 7a 75 20 68 c3 \|sie singe +n zu h.\| 00000060 b6 72 65 6e 2e 20 42 65 73 6f 6e 64 65 72 65 20 \|.ren. Bes +ondere \| 00000070 41 75 66 74 72 69 74 74 65 20 77 65 72 64 65 20 \|Auftritte + werde \| 00000080 69 63 68 20 61 62 20 73 6f 66 6f 72 74 20 69 6d \|ich ab so +fort im\| 00000090 20 41 6e 73 63 68 6c 75 c3 9f 20 61 6e 20 64 69 \| Anschlu. +. an di\| 000000a0 65 20 45 6e 67 65 6c 77 6f 72 74 65 20 61 6e 6b \|e Engelwo +rte ank\| 000000b0 c3 bc 6e 64 69 67 65 6e 2e 0a 0a \|..ndigen. +..\| 000000bb [download] The above is a cut-n-paste from the webpage.html -- both the html and the txt show the same missing characters. I did read several things about UTF-8. I suppose the confusion lies in => if I create the file, I get my Latin-1. If I didn't create the file, there is only ASCII.	[reply] [d/l]
Re^5: HTML::Parser, file, print to Terminal by moritz (Cardinal) on Jul 13, 2010 at 14:38 UTC
The snippet you show is encoded in UTF-8. Next step: determine the encoding of the file in which umlauts display correctly on your terminal. Or even better: configure a clean UTF-8 enivronment. I suppose the confusion lies in => if I create the file, I get my Latin-1. If I didn't create the file, there is only ASCII. I'm confused indeed. If you don't create a file, it doesn't exist, neither with ASCII nor with UTF-8. Speaking of confusion, I think you try to achieve too much in one step. For example the title of your question metions HTML::Parser, which doesn't appear in the posting at all. So, small steps: Make sure you know which encoding your terminal understands. There's no point in proceeding before you have done this step. Find out what encodings your source files are. Seems to be UTF-8. In your perl scripts, decode everything coming from the outside (except when a module does it for you), and encode everything. `use utf8;`, and write your program files in UTF-8. If something doesn't work, find out where you violate any of the points of the previous steps. Perl 6 - links to (nearly) everything that is Perl 6.	[reply] [d/l]
Re^6: HTML::Parser, file, print to Terminal by victor_charlie (Novice) on Jul 13, 2010 at 19:37 UTC
Okay, this DOES work... `#!/usr/bin/perl -w # legaget.pl use strict; use Encode; my $filename = "engleword.html"; open FILE, "<", $filename or die $1; while( my $line = <FILE> ) { print encode( "utf8",$line); } close(FILE);` [download] What I have learned... `use utf8;` is for Unicode source code, filenames, deals with legacy stuff, not for encoding. I still have to grab the html and write to a file, I would still like to encode the string in place. Maybe later. I have come across this encode problem as a graphic artist. Customers used MSWord to generate text and then pasted the resulting text into html, or Adobe Pagemaker, PDF, etc. and everything is just hunky-dory on a WinBox, but on a Mac or Linux the results had missing characters. MS was late adopting Unicode. MS thought they had another answer with OpenType (I think it was) a fonts technology in partnership with Adobe. That fell apart. But in pre-XP MS text products the first byte set the encode for the text file. I used to have a FreeWare program on the PC that manually changed that byte. Forgive me, I worked on this silly problem all day, but I'm loving Perl.	[reply] [d/l] [select]