Looking for suggestions for writing a module to look up translations in a 9 MB XML dictionary

skangas has asked for the wisdom of the Perl Monks concerning the following question:

I've been coding Perl on and off for a couple of years now. Recently, I've been coding some stuff to make my life easier. For example, I enjoy reading science fiction in English, but my natural language is Swedish; this means I have to constantly look up new words. I've been doing this manually using a printed dictionary, but that's getting frustrating.

There is a project which is building a English <=> Swedish dictionary. The dictionary is published in raw XML under the Distributed Creative Commons Attribution-Share Alike 2.5 Generic license. They have a web interface but web interfaces are, at least to me, by their very nature unpleasant.

Therefore, I decided to write a module to do do look-ups using this XML data. I also wanted to do it in a way so that it can be useful to others. Hopefully it'll even get to a state where I can publish it to CPAN. To accomplish this, I felt I should do my homework first. I haven't written a line of code thus far, because I haven't decided on the tools I should use.

The full XML file is 9 MB and is available here. What I want to do is to make some kind of way to easily make look-ups, i.e. getting from one word to it's translation in the easiest way possible. There are word classes (verbs, nouns, etc) as well as multiple translations and examples of how the words are used. This should probably be handled as well.

Some sample data:

<?xml version="1.0" encoding="utf-8"?>
<dictionary created="2009-02-24" last-changed="2009-08-21" name="Folke
+ts lexikon" license="http://creativecommons.org/licenses/by-sa/2.5/" 
+comment="Distributed under the Creative Commons Attribution-Share Ali
+ke 2.5 Generic license" origin="http://folkets-lexikon.csc.kth.se" so
+urce-language="en" target-language="sv" version="1.1">
<word value="abacus" lang="en" class="nn"> <translation value="kul|ram
+"/> <explanation value="gammalt räkneredskap med rörliga kulor"> </ex
+planation> </word>
<word value="abaft" lang="en" class="ab"> <translation value="akter ut
+"/> </word>
<word value="abaft" lang="en" class="pp"> <translation value="akter om
+" comment="sjöterm"/> </word>
<word value="abandoned" lang="en" class="jj"> <translation value="löss
+läppt"/> <translation value="otyglad"/> <translation value="utsvävand
+e"/> <translation value="fördärvad"/> <example value="otyglat beteend
+e"> <translation value="abandoned behaviour"/> </example> </word>
[download]

Now, I've already asked about this in the perlmonks chatterbox. The general consensus seemed to be that I should use XML::LibXML or XML::Twig for parsing the XML. They both seem reasonable alternatives, and I'll probably use one of them. However, bart pointed out I could first read the data to some kind of database and then do the word look-ups from there. This seemed like a good idea because as he was right to point out, XML isn't necessarily well suited for doing lookups.

If I go with this approach, I would ideally have one module with common methods for looking up words. I don't know about any dictionary modules on CPAN so I guess I'll have to write this myself. (There is Lingua::Translate but this seems to be for translating sentences rather than looking up dictionary definitions with multiple synonyms etc.) I would like this module to be able to use different "back-ends" similarly to how DBIx::Class does it (or Lingua::Translate for that matter). This means one should be able to choose between getting the data from the raw XML, or to use some kind of database. No matter which one you choose, the same methods should be used to look-up words.

Using a database does need some kind of setting up beyond downloading an XML file. As I'm pretty familiar with DBIx::Class -- and I know this supports sqlite as well as MySQL and whatnot -- I'm thinking to go with this. However, as this is not a "relational" database per se, one could argue I would be better of using something like CouchDB. As I don't really know CouchDB, I would appreciate any suggestions in this matter as well.

Now, I don't know if I'm approaching this problem in the right way, or if I'm missing something. There may be other obvious ways to solve this problem that I'm missing due to lack of experience. I'd rather not have my first CPAN module be a complete flop. Even if it's not widely used (something I can hardly expect) I do want it to have the potential to be useful at least to some people. Perhaps there are other Swedes who enjoys to read science fiction as much as myself.

I know this is a rather lengthy post, so I thank you for taking the time to read it. I already know the basics when it comes to authoring CPAN modules, such as writing documentation, tests, using Module::Starter, Perl::Critic, and so on, so I'm not looking for advice specifically about this. However, I'm very interested in any pointers, insights or rants you might want to share with me in any and all matters regarding this. Thanks!

Comment on Looking for suggestions for writing a module to look up translations in a 9 MB XML dictionary Download Code

Replies are listed 'Best First'.
Re: Looking for suggestions for writing a module to look up translations in a 9 MB XML dictionary by Anonymous Monk on Nov 03, 2009 at 19:44 UTC
I would forget all about CouchDB. I would take some time to learn more about linguistics; See Wordnet (Lingua::Wordnet, EuroWordNet...) and Natural Language Toolkit. I would take a close a very close look at VisDic a graphical application for viewing and editing dictionary databases stored in XML format And I would eSpeak text to speech Regarding Module::Starter... see Test::XT for better author-tests templates.	[reply]
Re^2: Looking for suggestions for writing a module to look up translations in a 9 MB XML dictionary by skangas (Novice) on Nov 04, 2009 at 12:07 UTC
Thanks for your suggestions. I've read up a bit on WordNet and the links you've suggested. However, all this seems a bit over my head considering the scope of this project. I only want a simple word to translation mapper, optionally with some extra information displayed about the suggested words. WordNet in this case means too much extra complexity. There are other people that are better suited to build a Swedish WordNet – linguists, for example – and in fact, there already is a project working towards this end. (On top of that, even though they are developing it, the Swedish WordNet doesn't seem to be released anyhwere, much less using a free license.) Natural Language Toolkit has essentially the same flaw – too much complexity. It seems more aimed at language researchers than petty programmers looking to build a dictionary. From what I could see, VisDic was proprietary software and didn't even have any source code around to read. It also seemed to be developed by people with everything but freedom on their mind – the softwares successor was available only after accepting a particularly nasty license. To me, this is a sign I should steer clear. As for Test::XT, thanks. The boilerplate tests I set up was configured in the suggested way, but it's always nice to automate away these things.	[reply]
Re: Looking for suggestions for writing a module to look up translations in a 9 MB XML dictionary by SuicideJunkie (Vicar) on Nov 03, 2009 at 20:16 UTC
It seems to me that 9MB is really not very much. Couldn't you simply keep it in memory as a hash, and do the lookups instantly? Startup time is covered by the time it takes you to pick up the book and flip to wherever you last stopped reading.	[reply]
Re: Looking for suggestions for writing a module to look up translations in a 9 MB XML dictionary by afoken (Chancellor) on Nov 04, 2009 at 20:38 UTC
Hmmm, I would get rid of all the parsing overhead XML introduces. A simple-and-stupid way to implement this could be a Makefile (or similar logic) that generates a fast (non-SQL) database file from the XML master whenever it changes (its timestamp). Ideally, reading from the database file would not change its modification time (as an unwanted side-effect). My very first idea was to use SQLite, but I think djb's CDB should be way faster. Of course, there is a CDB_File on CPAN, and for extra bonus points, it is capable of generating new CDB files. Alexander -- Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)	[reply]
Re^2: Looking for suggestions for writing a module to look up translations in a 9 MB XML dictionary by rastoboy (Monk) on Nov 05, 2009 at 09:42 UTC
I'd have to agree with SuicideJunkie on the KISS principle. 9MB is nothing these days, and really wall you want it simple word lookups. I can identify because, though I'm a native English speaker, when I tried to read Patrick O'Brien's Master and Commander books (which are truly awesome, btw), I ended up bringing up my giant unabridged Webster's dictionary every page or so. Sometimes it wasn't words I didn't know, but words I knew that I suspected he was making archaic use of. That ended up being a Killer App for me to buy a Palm Pilot, as that was the only platform any unabridged English dictionary was available for (Webster's, again). That way, instead of physically manhandling this paper monstrousity, I could just keep my Palm Pilot nearby and just look up words. So I say screw it, keep it simple in memory, and utilize Term::Readkey or some such to monitor keystrokes and start showing suggestions, a la Youtube's search box javascript.	[reply]
Re^2: Looking for suggestions for writing a module to look up translations in a 9 MB XML dictionary by skangas (Novice) on Nov 05, 2009 at 21:06 UTC
Your suggestions will be very valuable; thanks. I wasn't even aware of cdb, which seems like a pretty good fit in this case. The fact that it has been written by djb only adds to it's attractiveness.	[reply]