But more important was my other half sentence: Would it be feasible to build a list of researcher's names (or other type of ID) and their preferred encodings? Or did most of them author only one or two records?

On the basis of the very small sample I've seen, there are no authorship -- individual or institution -- identifiers. The only semi-consistent thing are the species names in Latin, (the language) and mostly in Latin-1 encoding; but:

  1. They appear in comment cards that are freeform and also contain 8-bit chars that represent different code pages depending where they originate from.
  2. Often the species names are abbreviated.
  3. At least 2 of the small sample also used 8-bit chars in the species name. Specifically, the character that combines a & e into a single char.

There are many, many of these files. The comment cards are easy to locate and extract; and the desire is to build a single index to them all legacy and new; but the institute commissioning the work has neither the skills nor funding to pay people with the appropriate skills (languages and science) to inspect an translate/convert them in order to unify them.

They were hoping to throw the problem at a (cheap) computer program and have it magically fix the problem. Like many of those in research they've heard of AI, but don't have any appreciation of what's really involved.

I quite litereally had no idea what would happen if I threw a bunch of non-unicode & unicode strings at perl's sort. I half hoped that it might do something sensible with the mix; hence I asked my question.

Personally, I've reach the point in my career where I am able to choose what work I take on; and this is simply not something I can be bothered with.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
In the absence of evidence, opinion is indistinguishable from prejudice.

In reply to Re^14: Mixed Unicode and ANSI string comparisons? by BrowserUk
in thread Mixed Unicode and ANSI string comparisons? by BrowserUk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.