in reply to Re^3: create clone script for utf8 encoding
in thread create clone script for utf8 encoding
I see.
Also, just to nitpick a little, it's not the OS guessing the file's encoding, it's the file tool.Thank you for the delousifying reference at file. I pulled out what I thought was relevant. I've "known" this before, but if you get behind on reading, things change:
file tests each argument in an attempt to classify it. There are three sets of tests, performed in this order: filesystem tests, magic tests, and language tests. The first test that succeeds causes the file type to be printed.
The filesystem tests are based on examining the return from a stat(2) system call. The program checks to see if the file is empty, or if it's some sort of special file. Any known file types appropriate to the sys- tem you are running on (sockets, symbolic links, or named pipes (FIFOs) on those systems that implement them) are intuited if they are defined in the system header file <sys/stat.h>.
If a file does not match any of the entries in the magic file, it is examined to see if it seems to be a text file. ASCII, ISO-8859-x, non- ISO 8-bit extended-ASCII character sets (such as those used on Macintosh and IBM PC systems), UTF-8-encoded Unicode, UTF-16-encoded Unicode, and EBCDIC character sets can be distinguished by the different ranges and sequences of bytes that constitute printable text in each set. If a file passes any of these tests, its character set is reported. ASCII, ISO-8859-x, UTF-8, and extended-ASCII files are identified as `text' because they will be mostly readable on nearly any terminal; UTF-16 and EBCDIC are only `character data' because, while they contain text, it is text that will require translation before it can be read. In addition, file will attempt to determine other characteristics of text-type files. If the lines of a file are terminated by CR, CRLF, or NEL, instead of the Unix-standard LF, this will be reported. Files that contain embedded escape sequences or overstriking will also be identified.
Once file has determined the character set used in a text-type file, it will attempt to determine in what language the file is written. The lan- guage tests look for particular strings (cf. <names.h> ) that can appear anywhere in the first few blocks of a file. For example, the keyword .br indicates that the file is most likely a troff(1) input file, just as the keyword struct indicates a C program. These tests are less reliable than the previous two groups, so they are performed last. The language test routines also test for some miscellany (such as tar(1) archives).
I do now. Life is like a box of chocolates with pre tags for this particular forrest gump. The engine that parses the xml is gonna look at [ ] and create a hyperlink, isn't it? I think I'm gonna go back to code tags, even when content has cyrillic. Others can make a clean download without having to copy and paste off the screen.
|
|---|