Re^3: Arabic Encodding Problem

The non-ASCII content in that HTML data is also non-UTF8. Treating it as CP-1256 will probably yield suitable results.

If there are a bunch of HTML files like this (and also a bunch that really are utf8), and you don't want to waste too much time sorting them out, you can add a subroutine like this to your program:

use Encode;

sub check_encoding
{
    my ( $inp_name ) = @_;
    open( my $fh, '<:raw', $inp_name ) or return "$inp_name: open fail
+ed: $!";
    my $str = '';
    until ( $str =~ /[^[:ascii:]]/ ) {
        $str = <$fh>;
    }
    if ( $str =~ /^[[:ascii:]]+$/ ) {
        return "ascii";
    }
    eval { $_ = decode( 'utf8', $str, Encode::FB_CROAK ) };
    if ( $@ ) {
        return "cp1256";  # We assume Arabic only, so if not utf8, the
+n cp1256
    }
    else {
        return "utf8";
    }
}
[download]

(update: removed hyphen from "cp1256")

Call that subroutine for each file name, and it will return the string that you should use for the encoding spec when you open the file for parsing. If you handle data for any language other than Arabic, and encounter the same problem, you'll need to tweak this to return some other non-unicode encoding, depending on the language.

You'll want to read the man page for Encode, especially the part about "Handling Malformed Data".

Comment on Re^3: Arabic Encodding Problem Download Code