I used to program for a dictionary company in Scotland. We had received EU money to produce a Catalan-English dictionary, but the only electronic Catalan resource we had was a word list. We had to produce a dictionary framework to match our Spanish dictionary in double-quick time, ready for our Catalan translators to turn into a finished text. But how? I'm no linguist.

I noticed that Catalan looks a bit like Spanish, but with French word endings (and if that statement doesn't get me a visit from the Catalonian death squad, nothing will). If you fiddle with the ends of the words, you got something that looked almost, but not quite, like Spanish. This goes against nearly all linguistic theory, but seems to work.

Computers don't do almost too well. While casting about for an approximate solution, I found Arizona U's agrep utility, which does approximate searching. Building a shell script around agrep to produce possible matches sort-of worked, but was painfully slow. Conveniently, CPAN librarian Jarkko Hietaniemi had just come out with a new version of String::Approx, which basically did what agrep did, but in a Perl module, and allowed a bit more control of the fuzziness parameters.

The key to approximate matching is the Levenshtein Edit Distance, effectively the number of character changes you can accept in a string that it will still be considered approximately equal to another. Allowing two changes per word, and a barrage of about 10 heuristics (a fancy CompSci word for "guesses") to play with the word form, I got an approximate 70% match rate with the Spanish dictionary text. This was good enough that it saved weeks of manual compilation time, plus had the neat side effect of an amusing correspondence with Jarkko while I suggested improvements to the package.

(For the curious, setting the edit distance to 1 returned too few possible matches. Setting it to 3 generated thousands of false positives.)

And the tablet? What has it to do with it all? Frankly, very little, apart from the fact it's superconcentrated programmer fuel, and I make it to an old family recipe. It's as good as the combination of sugar, condensed milk, butter and real vanilla can be… try some (link is PDF, alas).

Modification, 21 Oct 02002: Recipe now available in XHTML, as it should have been all along: http://www3.sympatico.ca/scruss/tablet.html

--
foreach(split('',"\3\3\3c>\0>c\177cc\0~c~``\0cc\177cc")) {$_=unpack('B8',$_);tr,01,\40#,;print$_,"\n";}##IYDKINT!


In reply to How I Created a Catalan-English Dictionary from a Spanish-English Dictionary Using Only String::Approx and Approximately 500 grams of Scots Tablet by Willard B. Trophy

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.