Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

How I Created a Catalan-English Dictionary from a Spanish-English Dictionary Using Only String::Approx and Approximately 500 grams of Scots Tablet

by Willard B. Trophy (Hermit)
on Oct 17, 2002 at 19:17 UTC ( [id://206106]=CUFP: print w/replies, xml ) Need Help??

I used to program for a dictionary company in Scotland. We had received EU money to produce a Catalan-English dictionary, but the only electronic Catalan resource we had was a word list. We had to produce a dictionary framework to match our Spanish dictionary in double-quick time, ready for our Catalan translators to turn into a finished text. But how? I'm no linguist.

I noticed that Catalan looks a bit like Spanish, but with French word endings (and if that statement doesn't get me a visit from the Catalonian death squad, nothing will). If you fiddle with the ends of the words, you got something that looked almost, but not quite, like Spanish. This goes against nearly all linguistic theory, but seems to work.

Computers don't do almost too well. While casting about for an approximate solution, I found Arizona U's agrep utility, which does approximate searching. Building a shell script around agrep to produce possible matches sort-of worked, but was painfully slow. Conveniently, CPAN librarian Jarkko Hietaniemi had just come out with a new version of String::Approx, which basically did what agrep did, but in a Perl module, and allowed a bit more control of the fuzziness parameters.

The key to approximate matching is the Levenshtein Edit Distance, effectively the number of character changes you can accept in a string that it will still be considered approximately equal to another. Allowing two changes per word, and a barrage of about 10 heuristics (a fancy CompSci word for "guesses") to play with the word form, I got an approximate 70% match rate with the Spanish dictionary text. This was good enough that it saved weeks of manual compilation time, plus had the neat side effect of an amusing correspondence with Jarkko while I suggested improvements to the package.

(For the curious, setting the edit distance to 1 returned too few possible matches. Setting it to 3 generated thousands of false positives.)

And the tablet? What has it to do with it all? Frankly, very little, apart from the fact it's superconcentrated programmer fuel, and I make it to an old family recipe. It's as good as the combination of sugar, condensed milk, butter and real vanilla can be… try some (link is PDF, alas).

Modification, 21 Oct 02002: Recipe now available in XHTML, as it should have been all along: http://www3.sympatico.ca/scruss/tablet.html

--
foreach(split('',"\3\3\3c>\0>c\177cc\0~c~``\0cc\177cc")) {$_=unpack('B8',$_);tr,01,\40#,;print$_,"\n";}##IYDKINT!

  • Comment on How I Created a Catalan-English Dictionary from a Spanish-English Dictionary Using Only String::Approx and Approximately 500 grams of Scots Tablet

Replies are listed 'Best First'.
Re: How I Created a Catalan-English Dictionary from a Spanish-English Dictionary Using Only String::Approx and Approximately 500 grams of Scots Tablet
by JPaul (Hermit) on Oct 18, 2002 at 02:15 UTC
    -Greetings;

    I double-vote on Helensburgh Tablet being a programmers fuel, many a nights I've spent chowin' it down.

    My grandmother emigrated from Helensburgh (Scotland) back in the '70s and is hailed as one of the finest Tablet makers ever... And the quality goods are worth their weight in pure caffeine.

    JP who doesn't get Tablet anymore since he emigrated to the US,
    - Alexander Widdlemouse undid his bellybutton and his bum dropped off --

      You could try making it; it's really easy. Or get someone local to feel sorry for you, point them to the recipe, and they can make it -- I've actually had that happen for homesick Scots expats.

      I've toyed with adding a couple of shots of espresso to the mix for the ultimate sugar/fat/caffeine hit, but that would be overkill. Since I don't eat chocolate, all pleas for me to make chocolate coated tablet have been ignored.

      If the Perl programming doesn't work out, I could always start up in confectionery. The dictionary staff in two continents are hooked, and I have several addicts in Toronto.pm...

      (did I say I was the person who got Perl into the Collins English Dictionary?)

      --
      foreach(split('',"\3\3\3c>\0>c\177cc\0~c~``\0cc\177cc")) {$_=unpack('B8',$_);tr,01,\40#,;print$_,"\n";}##IYDKINT!

        Greetings;

        You could try making it; it's really easy.
        I think you must have some kind of "EZ-Tablet" recipe, because I've tried making my grandmothers Tablet and all I can get is toffee :P

        JP,
        -- Alexander Widdlemouse undid his bellybutton and his bum dropped off --

Re: How I Created a Catalan-English Dictionary from a Spanish-English Dictionary Using Only String::Approx and Approximately 500 grams of Scots Tablet
by lestrrat (Deacon) on Oct 17, 2002 at 19:27 UTC

    You build up the suspense and give us no code... I'm so disappointed now ;)

Re: How I Created a Catalan-English Dictionary from a Spanish-English Dictionary Using Only String::Approx and Approximately 500 grams of Scots Tablet
by Mr. Muskrat (Canon) on Oct 17, 2002 at 21:08 UTC

    I must agree, code please!

    I'm no linguist but I like to play one on IRC!

      This is what I did -- I'm sorry I don't have any code, it's currently hidden away on same publisher's system.

      1) Took the Spanish-English dictionary, munged the spanish translations into translator's notes for the Catalan team, and then hashed each entry against its headword, such that $hash{'headword'} would contain the complete entry text.

      2) For each word in the Catalan list:
      a) checked to see if there was an exact match in the hash keys to a Spanish headword; if not:
      b) tried to apply each one of a list of ending heuristics; if one of these matched exactly, use it, else try a fuzzy match.

      I seem to remember that String::Approx returned a list of possible matches from keys(%hash). I used the simple expedient of using the first one it returned. There were probably better ways of doing it, but this seemed to be adequate.

      Sorry for lack of code. If it's any consolation, I remember what I called the program: The Hortalizer. That's because hortalissa is Spanish for vegetable, while the Catalan is hortaliza. I even put in one guess, erm heuristic, just to catch this word.

      --
      foreach(split('',"\3\3\3c>\0>c\177cc\0~c~``\0cc\177cc")) {$_=unpack('B8',$_);tr,01,\40#,;print$_,"\n";}##IYDKINT!

        That's because hortalissa is Spanish for vegetable, while the Catalan is hortaliza.

        I'm afraid you mixed up the Spanish (hortaliza) and the Catalan (hortalissa) words. If I remember correctly, there are no words in Spanish with two s in a row.

        See http://www.diccionarios.com/ for a Catalan-Castilian Spanish, Castilian Spanish-Catalan Dictionary.

        -- Ricardo
        Use MacPerl;

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: CUFP [id://206106]
Approved by hossman
Front-paged by wil
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (5)
As of 2024-04-16 15:35 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found