in reply to Re^9: Mixed Unicode and ANSI string comparisons?
in thread Mixed Unicode and ANSI string comparisons?

I don't see what decoding has to do with translating from one language to another.

The data is. They are free form descriptions produced by researchers from many countries. Parts of most of them will be in Latin (the language not the encoding); parts will be in the researchers own language.

It's not a case of "translating from one language to another", it is having someone who understands what is in the file so that you could decide how to decode it. The files go back decades; researchers move on. The data continues to exist.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
In the absence of evidence, opinion is indistinguishable from prejudice.
  • Comment on Re^10: Mixed Unicode and ANSI string comparisons?

Replies are listed 'Best First'.
Re^11: Mixed Unicode and ANSI string comparisons?
by soonix (Chancellor) on Dec 15, 2015 at 12:05 UTC

      The first module covers 10 European languages; the small sample I saw contained Cyrillic, Arabic, Urdo, and what I think (but can't swear to) were Korean and Japanese.

      The second appears to be completely undocumented, but given its author, I'm guessing is designed to try and determine which of the multitude of Unicrap encodings a file contains, rather than anything to do with ISO-8859-x stuff.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
      In the absence of evidence, opinion is indistinguishable from prejudice.

        The missing comments and description, in connection with the author's reputed name, made me stop short, and I had a (short) look at the source.
        It seems to try to distinguish several ISO-8859-x variants and codepages, and that seemed relevant enough for the problem at hand. Otherwise I would't have mentioned it.

        But more important was my other half sentence: Would it be feasible to build a list of researcher's names (or other type of ID) and their preferred encodings? Or did most of them author only one or two records?