if there are no bytes with the 8th bit set then there's no problem -- nevermind else if ( any bytes match /[\xc0\xc1\xc4-\xff]/, or an odd number of bytes match /[\x80-\xff]/ ) then it must be Latin1 else make a copy delete everything that could be utf8 forms of Latin1 characters: s/\xc2[\xa0-\xbf]|\xc3[\x80-\xbf]//g; if this removes all bytes with 8th-bit set, then the original data is almost certainly utf8 else the original data is definitely Latin1