Re: What's the best way to detect character encodings, Windows-1252 v. UTF-8?

Replies are listed 'Best First'.
Re^2: What's the best way to detect character encodings, Windows-1252 v. UTF-8? (Areas of confusion) by ikegami (Patriarch) on Jun 17, 2011 at 15:53 UTC
That code would only guess wrong if all of the following are true: The text is encoded using Windows-1252 (or iso-8859-1), At least one of [€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ`<NBSP>`¡¢£¤¥¦§¨©ª«¬`<SHY>`®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷] is present, All instances of [ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß] are always followed by exactly one of [€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ`<NBSP>`¡¢£¤¥¦§¨©ª«¬`<SHY>`®¯°±²³´µ¶·¸¹º»¼½¾¿], All instances of [àáâãäåæçèéêëìíîï] are always followed by exactly two of [€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ`<NBSP>`¡¢£¤¥¦§¨©ª«¬`<SHY>`®¯°±²³´µ¶·¸¹º»¼½¾¿], All instances of [ðñòóôõö÷] are always followed by exactly three of [€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ`<NBSP>`¡¢£¤¥¦§¨©ª«¬`<SHY>`®¯°±²³´µ¶·¸¹º»¼½¾¿], None of [øùúûüýþÿ] are present, and None of [€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ`<NBSP>`¡¢£¤¥¦§¨©ª«¬`<SHY>`®¯°±²³´µ¶·¸¹º»¼½¾¿] are present except where previously mentioned. In other words, that code is very reliable.	[reply] [d/l] [select]
Re^2: What's the best way to detect character encodings, Windows-1252 v. UTF-8? by Jim (Curate) on Jun 17, 2011 at 16:10 UTC
my $bytes = '...'; How do I ensure that `$bytes` are bytes, not characters? I'm on Microsoft Windows and the text files are in the DOS format (i.e., CR-LF newlines) In other words, what I/O layer must I use? `'<:raw'`? Jim	[reply] [d/l] [select]
Re^3: What's the best way to detect character encodings, Windows-1252 v. UTF-8? by ikegami (Patriarch) on Jun 17, 2011 at 16:14 UTC
`open(my $fh, '<:raw:perlio', $qfn)` [download] and `open(my $fh, '<', $qfn) binmode($fh);` [download] would do, but then you'd have to do CRLF translation. `open(my $fh, '<', $qfn)` [download] will actually work and properly do the CRLF translation (unless you set some default layers somewhere) despite decoding and CRLF translation being done in the wrong order. Note that `open(my $fh, '<:encoding(UTF-8)', $qfn)` [download] also decodes and does CRLF translation in the wrong order. That's why `open(my $fh, '<:encoding(UTF-16le)', $qfn)` [download] doesn't work on Windows (of all places!).	[reply] [d/l] [select]
Re^4: What's the best way to detect character encodings, Windows-1252 v. UTF-8? by Jim (Curate) on Jun 17, 2011 at 16:38 UTC
So I think you're saying I should do the simplest thing and just open the files without specifying any I/O layer. In this case, Perl will do what I want. It will slurp the bytes of the file into a variable that it understands contains bytes, not characters, and it will also do what I want it to do with newlines, which is effectively to pass them through unmolested. What does `'<:raw:perlio'` do, exactly? Jim	[reply] [d/l]
Re^5: What's the best way to detect character encodings, Windows-1252 v. UTF-8? by ikegami (Patriarch) on Jun 17, 2011 at 18:55 UTC
Re^2: What's the best way to detect character encodings, Windows-1252 v. UTF-8? by Jim (Curate) on Jun 17, 2011 at 15:56 UTC
Thank you very much, ikegami. Unless it's valid US-ASCII, in which case it doesn't matter if you use Windows-1252 or UTF-8. Yep. Any purely ASCII text files will simply get a UTF-8 byte order mark prefixed to them, forcing them into Unicode goodness. EBCDIC text files will be blown to smithereens. In the context of what I'm doing, I don't care. Jim	[reply]
Re^3: What's the best way to detect character encodings, Windows-1252 v. UTF-8? by ikegami (Patriarch) on Jun 17, 2011 at 16:09 UTC
A purely US-ASCII text file cannot contain a Unicode BOM. BOM don't force Unicode goodness, whatever that means. I don't know why you bring up EBCDIC. You said only Windows-1252 and UTF-8 are possible. I changed the wording of the text you quoted in the hopes of being clearer.	[reply]
Re^4: What's the best way to detect character encodings, Windows-1252 v. UTF-8? by Jim (Curate) on Jun 17, 2011 at 16:25 UTC
Uh, I was writing whimsically and lightheardedly. (My goodness, you can find fault and contention in the most inocuous and innocent places, ikegami.) I know an ASCII text file cannot contain a Unicode BOM. The whole point of what I'm doing is to convert all the text files to Unicode if they aren't Unicode already. A purely ASCII text file is also a Unicode text file, just as it is also a text file in almost all other character encodings (but not EBCDIC, for example). So I'm going to add a BOM to all purely ASCII text files to make them not purely ASCII text files anymore. I'm doing this because, for better or worse, the world is now full of software that requires Unicode and is insistent that the Unicode-ness be unequivocal (i.e., that the text includes a BOM). I mentioned EBCDIC as a lark. Smile, would ya! :-) Thank you again for your help. Jim	[reply]