Re^3: Creating new character classes for foreign languages

Is it possible to define a double-character property? For example, the Thai 'r' becomes a vowel if, and only if, there are two of them together, as in 'rr'. It is then pronounced differently, and is no longer strictly an 'r'.

Here you are moving away from strictly orthographic matters into phonetics or phonology, which are essentially context-dependent, and this takes you out of the domain of merely classifying letter symbols into related groups, which is essentially not context-dependent.

If the goal is to provide a means for doing correct word segmentation of Thai text, the handling of the context-dependent rules (like "rr" becomes "un") should probably be in a separate module. The functions that work on sequences of characters will depend on the functions that define the basic character classes.

(You probably could put the subroutines for character-classes and context-dependent rules together in one module if you want to, but the two sets of subroutines will have very different usages from the caller's point of view. And the overall problem being addressed is probably complicated enough that you will want to segregate portions of the solution into separate modules anyway.)

Just curious: have you looked at Lingua::TH::Segmentation? I just happened to notice it was there, but I haven't tried it. Have you?

Comment on Re^3: Creating new character classes for foreign languages

Replies are listed 'Best First'.
Re^4: Creating new character classes for foreign languages by Polyglot (Chaplain) on May 17, 2009 at 16:48 UTC
Yes, I have looked at that Lingua::TH module. It fails to build on my system, and I have a hard enough time troubleshooting my own code, much less someone else's. The .pm file it has is only 2.2k, which amounts to a very slim algorithm for splitting Thai, as Thai is rather a complex problem when it comes to splitting. I'm actually leaning toward a lexical approach, and working on building a word list in Thai. In fact, I encountered errors of the wrong number of arguments upon running the 'perl Makefile.PL' command, and commented about five lines in the Makefile.PL before it would run...only to see a warning that the library file referred to was not present. So I'm thinking that it was designed to accompany some additional file, possibly a word lexicon. This is one of the reasons I'm embarking on this journey now. There is virtually nothing in CPAN for the Thai language, or for Lao either. (And I did some reading on CPAN today, having never submitted anything there before, and learned that a module's NAMESPACE is supposed to be community directed...but I know of no Thai community among Perl monks.) My needs go beyond splitting syllables. I plan to create a program which will translate Thai to Lao. There are some specific vowels and consonants that must be transposed in the exchange. Syllable splitting is a beginning, but only a part of the process. These tools I am packaging would be useful for many other purposes as well. Blessings, ~Polyglot~	[reply]
Re^5: Creating new character classes for foreign languages by graff (Chancellor) on May 17, 2009 at 17:45 UTC
So I'm thinking that it was designed to accompany some additional file, possibly a word lexicon. Yes, that module is clearly intended to serve only as a wrapper around a separate compiled software library (not written in perl), provided here: http://thaiwordseg.sourceforge.net/. You have to install that library first (which will probably involve a simple sequence like `./configure; make; make install`), and then try installing the perl module, which should include some tests that confirm whether the library was found and turns out to work as intended.	[reply] [d/l]
Re^5: Creating new character classes for foreign languages by jgamble (Pilgrim) on May 17, 2009 at 17:45 UTC
I have nothing to add here, but I want to say that this is a fascinating thread and I want to thank you for starting it.	[reply]