Re^4: Matching ┬ & А type characters with a regex

Replies are listed 'Best First'.
Re^5: Matching ┬ & А type characters with a regex by ikegami (Patriarch) on Feb 12, 2009 at 19:52 UTC
They are already clean if you view them correctly. It's like reading "Comment чр va?" and saying it's gibberish because it doesn't look English. It's not gibberish, it's just not English. What you have isn't junk, it's just not US-ASCII, iso-latin-1, cp1252, Shift_JIS or whatever else you might want it to be. Just like "Comment чр va?" can be translated to English, is possible to translate what you have to US-ASCII/iso-latin-1/cp1252/Shift_JIS/etc. The translation may not be perfect. You may encounter 1:0, 1:N and N:1 relations. So the question is: to what encoding would you like it translated to? What follows are a couple of methods for converting your text. The latter allows you to configure (via `from_to`'s fourth argument) what happens when a character in the source can't be represented by the destination encoding. See the docs for details. `use strict; use warnings; my $src_enc = 'UTF-8'; my $dst_enc = '...'; binmode(STDIN, ":encoding($src_enc)"); binmode(STDOUT, ":encoding($dst_enc)"); print while <STDIN>;` [download] or `use strict; use warnings; use Encode qw( from_to ); my $src_enc = 'UTF-8'; my $dst_enc = '...'; while (<>) { from_to($_, $src_enc, $dst_enc); print; }` [download] Update: Added code. Cleaned up some sentences.	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^5: Matching ┬ & А type characters with a regex
by ikegami (Patriarch) on Feb 12, 2009 at 19:52 UTC

They are already clean if you view them correctly. It's like reading "Comment чр va?" and saying it's gibberish because it doesn't look English. It's not gibberish, it's just not English. What you have isn't junk, it's just not US-ASCII, iso-latin-1, cp1252, Shift_JIS or whatever else you might want it to be.

Just like "Comment чр va?" can be translated to English, is possible to translate what you have to US-ASCII/iso-latin-1/cp1252/Shift_JIS/etc. The translation may not be perfect. You may encounter 1:0, 1:N and N:1 relations.

So the question is: to what encoding would you like it translated to? What follows are a couple of methods for converting your text. The latter allows you to configure (via from_to's fourth argument) what happens when a character in the source can't be represented by the destination encoding. See the docs for details.

use strict;
use warnings;

my $src_enc = 'UTF-8';
my $dst_enc = '...';

binmode(STDIN,  ":encoding($src_enc)");
binmode(STDOUT, ":encoding($dst_enc)");

print while <STDIN>;
[download]

use strict;
use warnings;

use Encode qw( from_to );

my $src_enc = 'UTF-8';
my $dst_enc = '...';

while (<>) {
   from_to($_, $src_enc, $dst_enc);
   print;
}
[download]

Update: Added code. Cleaned up some sentences.

[reply]
[d/l]
[select]