But more important was my other half sentence: Would it be feasible to build a list of researcher's names (or other type of ID) and their preferred encodings? Or did most of them author only one or two records?
On the basis of the very small sample I've seen, there are no authorship -- individual or institution -- identifiers. The only semi-consistent thing are the species names in Latin, (the language) and mostly in Latin-1 encoding; but:
There are many, many of these files. The comment cards are easy to locate and extract; and the desire is to build a single index to them all legacy and new; but the institute commissioning the work has neither the skills nor funding to pay people with the appropriate skills (languages and science) to inspect an translate/convert them in order to unify them.
They were hoping to throw the problem at a (cheap) computer program and have it magically fix the problem. Like many of those in research they've heard of AI, but don't have any appreciation of what's really involved.
I quite litereally had no idea what would happen if I threw a bunch of non-unicode & unicode strings at perl's sort. I half hoped that it might do something sensible with the mix; hence I asked my question.
Personally, I've reach the point in my career where I am able to choose what work I take on; and this is simply not something I can be bothered with.
In reply to Re^14: Mixed Unicode and ANSI string comparisons?
by BrowserUk
in thread Mixed Unicode and ANSI string comparisons?
by BrowserUk
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |