The non-ASCII content in that HTML data is also non-UTF8. Treating it as CP-1256 will probably yield suitable results.

If there are a bunch of HTML files like this (and also a bunch that really are utf8), and you don't want to waste too much time sorting them out, you can add a subroutine like this to your program:

use Encode; sub check_encoding { my ( $inp_name ) = @_; open( my $fh, '<:raw', $inp_name ) or return "$inp_name: open fail +ed: $!"; my $str = ''; until ( $str =~ /[^[:ascii:]]/ ) { $str = <$fh>; } if ( $str =~ /^[[:ascii:]]+$/ ) { return "ascii"; } eval { $_ = decode( 'utf8', $str, Encode::FB_CROAK ) }; if ( $@ ) { return "cp1256"; # We assume Arabic only, so if not utf8, the +n cp1256 } else { return "utf8"; } }
(update: removed hyphen from "cp1256")

Call that subroutine for each file name, and it will return the string that you should use for the encoding spec when you open the file for parsing. If you handle data for any language other than Arabic, and encounter the same problem, you'll need to tweak this to return some other non-unicode encoding, depending on the language.

You'll want to read the man page for Encode, especially the part about "Handling Malformed Data".


In reply to Re^3: Arabic Encodding Problem by graff
in thread Arabic Encodding Problem by malak

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.