Re: Guess between UTF8 and Latin1/ISO-8859-1

I think your situation is actually a lot simpler than others here would have it. If it's really true that you are only dealing with characters in the "Latin1" range (you seem pretty confident about that), and if the only point of uncertainty about your data is whether it's utf8 or iso-8859-1 (and you really don't need to worry about any other possible alternative for using the upper-table), then you just need to test a particular set of conditions using byte semantics.

The conditions can be stated in pseudo-code as follows:

if there are no bytes with the 8th bit set then
   there's no problem -- nevermind
else
   if ( any bytes match /[\xc0\xc1\xc4-\xff]/, or
        an odd number of bytes match /[\x80-\xff]/ ) then
      it must be Latin1
   else
      make a copy
      delete everything that could be utf8 forms of Latin1 characters:
      s/\xc2[\xa0-\xbf]|\xc3[\x80-\xbf]//g;
      if this removes all bytes with 8th-bit set, then
          the original data is almost certainly utf8
      else
          the original data is definitely Latin1
[download]

Now, if any of your assurances (assumptions?) happen to be wrong -- e.g. if there is "noise" in the data, causing a few non-ASCII values to appear "unintentionally", or if Latin1 is not the only single-byte encoding that might be used, or if utf8 encoding is being used and the data happens to include some unicode characters that are outside the Latin1 range (I've seen this rather often, where Word or some equally clever app uses stuff in the U2000 range for "recommended forms" of certain punctuation marks -- why these are recommended escapes me at the moment). If any of that could be true for your data, then this simple decision tree could be misleading.

(That last contingency, finding utf8 code points that don't map to Latin1, could be handled if you apply bart's more broadly scoped means for detecting things that look like utf8.)

Update: I adjusted the regex for matching things that look like utf8 renderings of Latin1 characters -- it used to be /[\xc2\xc3][\x80-\xbf]/ which was a bit broader than it needed to be for the situation described in the OP. In utf8, the sequence of byte pairs "\xc2\x80" thru "\xc2\x9f" would map to "\x80" thru "\xbf" in Latin1, which do not represent any printable characters. (This fact alone might motivate a check such as

if any bytes match /[\x80-\x9f]/ then
    it's pretty sure not to be Latin1
[download]

but again, whether this would be enough to conclude that it must be utf8 is just a matter of how much you trust your data, and your knowledge of it.)

One more update: while those byte-level tests are kinda neat, I think I would end up prefering a simpler, two step approach (which I think someone else must have mentioned by now):

   eval "\$_ = decode('utf8',\$orig_data,Encode::FB_CROAK)";
   if ($@) {
      # it's not utf8, and so must be iso-8859-1
   }
[download]

Comment on Re: Guess between UTF8 and Latin1/ISO-8859-1 Select or Download Code