in reply to Re^2: Arabic Encodding Problem
in thread Arabic Encodding Problem
If there are a bunch of HTML files like this (and also a bunch that really are utf8), and you don't want to waste too much time sorting them out, you can add a subroutine like this to your program:
(update: removed hyphen from "cp1256")use Encode; sub check_encoding { my ( $inp_name ) = @_; open( my $fh, '<:raw', $inp_name ) or return "$inp_name: open fail +ed: $!"; my $str = ''; until ( $str =~ /[^[:ascii:]]/ ) { $str = <$fh>; } if ( $str =~ /^[[:ascii:]]+$/ ) { return "ascii"; } eval { $_ = decode( 'utf8', $str, Encode::FB_CROAK ) }; if ( $@ ) { return "cp1256"; # We assume Arabic only, so if not utf8, the +n cp1256 } else { return "utf8"; } }
Call that subroutine for each file name, and it will return the string that you should use for the encoding spec when you open the file for parsing. If you handle data for any language other than Arabic, and encounter the same problem, you'll need to tweak this to return some other non-unicode encoding, depending on the language.
You'll want to read the man page for Encode, especially the part about "Handling Malformed Data".
|
|---|