skangas has asked for the wisdom of the Perl Monks concerning the following question:
I've been coding Perl on and off for a couple of years now. Recently, I've been coding some stuff to make my life easier. For example, I enjoy reading science fiction in English, but my natural language is Swedish; this means I have to constantly look up new words. I've been doing this manually using a printed dictionary, but that's getting frustrating.
There is a project which is building a English <=> Swedish dictionary. The dictionary is published in raw XML under the Distributed Creative Commons Attribution-Share Alike 2.5 Generic license. They have a web interface but web interfaces are, at least to me, by their very nature unpleasant.
Therefore, I decided to write a module to do do look-ups using this XML data. I also wanted to do it in a way so that it can be useful to others. Hopefully it'll even get to a state where I can publish it to CPAN. To accomplish this, I felt I should do my homework first. I haven't written a line of code thus far, because I haven't decided on the tools I should use.
The full XML file is 9 MB and is available here. What I want to do is to make some kind of way to easily make look-ups, i.e. getting from one word to it's translation in the easiest way possible. There are word classes (verbs, nouns, etc) as well as multiple translations and examples of how the words are used. This should probably be handled as well.
Some sample data:
<?xml version="1.0" encoding="utf-8"?> <dictionary created="2009-02-24" last-changed="2009-08-21" name="Folke +ts lexikon" license="http://creativecommons.org/licenses/by-sa/2.5/" +comment="Distributed under the Creative Commons Attribution-Share Ali +ke 2.5 Generic license" origin="http://folkets-lexikon.csc.kth.se" so +urce-language="en" target-language="sv" version="1.1"> <word value="abacus" lang="en" class="nn"> <translation value="kul|ram +"/> <explanation value="gammalt räkneredskap med rörliga kulor"> </ex +planation> </word> <word value="abaft" lang="en" class="ab"> <translation value="akter ut +"/> </word> <word value="abaft" lang="en" class="pp"> <translation value="akter om +" comment="sjöterm"/> </word> <word value="abandoned" lang="en" class="jj"> <translation value="löss +läppt"/> <translation value="otyglad"/> <translation value="utsvävand +e"/> <translation value="fördärvad"/> <example value="otyglat beteend +e"> <translation value="abandoned behaviour"/> </example> </word>
Now, I've already asked about this in the perlmonks chatterbox. The general consensus seemed to be that I should use XML::LibXML or XML::Twig for parsing the XML. They both seem reasonable alternatives, and I'll probably use one of them. However, bart pointed out I could first read the data to some kind of database and then do the word look-ups from there. This seemed like a good idea because as he was right to point out, XML isn't necessarily well suited for doing lookups.
If I go with this approach, I would ideally have one module with common methods for looking up words. I don't know about any dictionary modules on CPAN so I guess I'll have to write this myself. (There is Lingua::Translate but this seems to be for translating sentences rather than looking up dictionary definitions with multiple synonyms etc.) I would like this module to be able to use different "back-ends" similarly to how DBIx::Class does it (or Lingua::Translate for that matter). This means one should be able to choose between getting the data from the raw XML, or to use some kind of database. No matter which one you choose, the same methods should be used to look-up words.
Using a database does need some kind of setting up beyond downloading an XML file. As I'm pretty familiar with DBIx::Class -- and I know this supports sqlite as well as MySQL and whatnot -- I'm thinking to go with this. However, as this is not a "relational" database per se, one could argue I would be better of using something like CouchDB. As I don't really know CouchDB, I would appreciate any suggestions in this matter as well.
Now, I don't know if I'm approaching this problem in the right way, or if I'm missing something. There may be other obvious ways to solve this problem that I'm missing due to lack of experience. I'd rather not have my first CPAN module be a complete flop. Even if it's not widely used (something I can hardly expect) I do want it to have the potential to be useful at least to some people. Perhaps there are other Swedes who enjoys to read science fiction as much as myself.
I know this is a rather lengthy post, so I thank you for taking the time to read it. I already know the basics when it comes to authoring CPAN modules, such as writing documentation, tests, using Module::Starter, Perl::Critic, and so on, so I'm not looking for advice specifically about this. However, I'm very interested in any pointers, insights or rants you might want to share with me in any and all matters regarding this. Thanks!
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Looking for suggestions for writing a module to look up translations in a 9 MB XML dictionary
by Anonymous Monk on Nov 03, 2009 at 19:44 UTC | |
by skangas (Novice) on Nov 04, 2009 at 12:07 UTC | |
|
Re: Looking for suggestions for writing a module to look up translations in a 9 MB XML dictionary
by SuicideJunkie (Vicar) on Nov 03, 2009 at 20:16 UTC | |
|
Re: Looking for suggestions for writing a module to look up translations in a 9 MB XML dictionary
by afoken (Chancellor) on Nov 04, 2009 at 20:38 UTC | |
by rastoboy (Monk) on Nov 05, 2009 at 09:42 UTC | |
by skangas (Novice) on Nov 05, 2009 at 21:06 UTC |