The first module covers 10 European languages; the small sample I saw contained Cyrillic, Arabic, Urdo, and what I think (but can't swear to) were Korean and Japanese.
The second appears to be completely undocumented, but given its author, I'm guessing is designed to try and determine which of the multitude of Unicrap encodings a file contains, rather than anything to do with ISO-8859-x stuff.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
The missing comments and description, in connection with the author's reputed name, made me stop short, and I had a (short) look at the source.
It seems to try to distinguish several ISO-8859-x variants and codepages, and that seemed relevant enough for the problem at hand. Otherwise I would't have mentioned it.
But more important was my other half sentence: Would it be feasible to build a list of researcher's names (or other type of ID) and their preferred encodings? Or did most of them author only one or two records?
| [reply] |
But more important was my other half sentence: Would it be feasible to build a list of researcher's names (or other type of ID) and their preferred encodings? Or did most of them author only one or two records?
On the basis of the very small sample I've seen, there are no authorship -- individual or institution -- identifiers. The only semi-consistent thing are the species names in Latin, (the language) and mostly in Latin-1 encoding; but:
- They appear in comment cards that are freeform and also contain 8-bit chars that represent different code pages depending where they originate from.
- Often the species names are abbreviated.
- At least 2 of the small sample also used 8-bit chars in the species name. Specifically, the character that combines a & e into a single char.
There are many, many of these files. The comment cards are easy to locate and extract; and the desire is to build a single index to them all legacy and new; but the institute commissioning the work has neither the skills nor funding to pay people with the appropriate skills (languages and science) to inspect an translate/convert them in order to unify them.
They were hoping to throw the problem at a (cheap) computer program and have it magically fix the problem. Like many of those in research they've heard of AI, but don't have any appreciation of what's really involved.
I quite litereally had no idea what would happen if I threw a bunch of non-unicode & unicode strings at perl's sort. I half hoped that it might do something sensible with the mix; hence I asked my question.
Personally, I've reach the point in my career where I am able to choose what work I take on; and this is simply not something I can be bothered with.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |