pc2 has asked for the wisdom of the Perl Monks concerning the following question:
salutations,
we would like your suggestions for the following problem.suppose we have a string, like "cowboycaddog".
what we want is a way to separate the words of "cowboycaddog", observing (hipothetically) that, as an euphony rule of English, t is changed to d before d (t|d -> dd, cat + dog = caddog), considering a given list of isolated words, like:
that were possibly used to form the string.cowboy cow boy cat do dog
t|d -> dd would be a juncture rule that was used to form the string.
thus, cowboycaddog and the above lexicon would output:it would never output "cow-boy-cad-dog", because the word "cad" is not in the lexicon. this is just an example, there could be other juncture rules, for example: i|o -> ito, so the string "territory" and the morpheme list (terri, ory) would output terri-ory, for example.cowboy-cat-dog cow-boy-cat-dog
is this possible? any suggestion is welcome.
thanks in advance.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: unglue words joined together by juncture rules
by roboticus (Chancellor) on Mar 21, 2008 at 20:57 UTC | |
The bioinformatics crowd seems to do a lot of work around this sort of problem ... matching up fragments of DNA and genes and other squishy bits. You might look around and see what sort of things they're working on. ...roboticus | [reply] |
|
Re: unglue words joined together by juncture rules
by Narveson (Chaplain) on Mar 21, 2008 at 21:56 UTC | |
Save your juncture rules in a hash.
Still to do: decompose each possibility into dictionary words. | [reply] [d/l] |
by oko1 (Deacon) on Mar 21, 2008 at 23:29 UTC | |
> Still to do: decompose each possibility into dictionary words. This just happens to be similar to something I've been fiddling with recently.
The output looks like this: 'cowboycatdog' has the following anagrams: cowboy cat dog cow boy cat dog boy cow cat dog cat cowboy dog dog cowboy cat | [reply] [d/l] |
|
Re: unglue words joined together by juncture rules
by pc2 (Beadle) on Mar 22, 2008 at 12:08 UTC | |
thank you for the responses. more suggestions are welcome.let us explain the problem further. actually, what we are wanting to make is a word separator, based on a lexicon and a set of phonetic rules (of any kind, for example, I am -> I'm, vowel + vowel = long vowel (a + a = â; a + a = aya)) an example of language that has these characteristics is Sanskrit. for example, "Krishnah" + "dhaavati" = "Krishnodhaavati", "namaH" + "te" = "namaste" (aH|dh -> odh; aH|t -> ast; as| -> aH (as at the end of the string turns to aH)), etc.for making things easier, the lexicons used to form the strings are generally quite small, and how we generate it is actually irrelevant for this problem. for example, krishnodhaavati would be analysed by means of the lexicon ('krishnaH', 'dhaavati', 'naH', 'dhaa') (the set of chunks of the string that exist in the dictionary), but how we generate this lexicon doesn't matter, we just want to join its words based on a set of phonetic rules (aH|dh -> odh, a|e -> aye, a|e -> ai, aH|t -> ast, aH|n -> on, for example) in order to obtain back the original string segmented in all its possibilities (namaH-te; krishnaH-dhaavati; namaH-namaH; etc.). of course, there will not always be only one possibility of segmentation, given a big quantity of juncture rules and a (relatively) big lexicon of possible morphemes. | [reply] |
by halley (Prior) on Mar 24, 2008 at 18:42 UTC | |
-- | [reply] [d/l] |
by pc2 (Beadle) on Mar 26, 2008 at 11:53 UTC | |
funny answer. but we can assure you that it is definitely not what we want to do; we HATE when we receive those annoying spams and see those things in Google, it is like we are surrounded by bad people who only want to trick us and get away with this. it is just for a personal project of a morphological analyser.we are not mad at you, we perfectly understand your cynicism... it is hard those days to have good faith in humanity. | [reply] |
|
Re: unglue words joined together by juncture rules
by benizi (Hermit) on Mar 25, 2008 at 08:18 UTC | |
The "right" way to do this is to use Finite State Transducers. They're used quite a bit in morphological analysis (deconstructing a word into its morphemes). I enjoyed Finite State Morphology, by Lauri Karttunen and Kenneth R. Beesley. A lot of the material you'll find will be very academic, and the field is a bit Finnish-heavy (It has far richer morphology than English). But one of the attractive features of the technology is its run-time efficiency. There are a couple widely-used toolkits: Xerox Finite State Toolkit, which comes with the book I mentioned above. (might have licensing issues). And the MIT FST Toolkit. Some relevant acronyms are WFST, FSA, FSM, and FST for weighted finite state transducers, finite state automata, finite state machines, and finite state transducers. I'm pretty sure Google has a toolkit that's relevant, but I can't seem to find it (I think it uses yet-another acronym for a class of machines that contains WFST's.) None of these is a Perl solution. Update: fixed Wikipedia link Update 2: The Google Research-related kit is OpenFST. | [reply] |
|
Re: unglue words joined together by juncture rules
by Gavin (Archbishop) on Mar 22, 2008 at 14:44 UTC | |
| [reply] |
|
Re: unglue words joined together by juncture rules
by mobiusinversion (Beadle) on Apr 01, 2008 at 05:06 UTC | |
You will be responsible for installing your rules into the ungluer via a dispatch table of anonymous subroutines. Please note: I consider that to be intermediate Perl programming. So if you are looking for something trivial, stop reading now. These functions should return an array of arrays. In their simplest form, these functions could take in 1 word and return 2. Keep this in mind: You will be installing inverted associations. So instead of installing rules of the form: you will be installing rules of the form: The solution is in a subroutine (below) called functional_unglue and is called like this: Here is a template you should use: If you think about it, that is the only way a solution makes sense; You want your models to be extensible and you want to be able to change your minds later, and so it must be up to you to install new rules. Here is what you get in return: This solution promises to apply all of your rule sets, and recover all of the possible ungluings, in the fastest and most memory efficient way possible, without any knowledge of the rules themselves. Here it is: So, for example, use it like this: Produces: Or you could try this: Produces: Now why this works and how this works is your job to figure out. (It sounds like you need to get some more advanced books on Perl, try Object Oriented Perl and Higher Order Perl) Try the consonant transposition problem on your own. If you get stuck after trying for a few days, post back, and I'll show you how. Finally, look how easy it is to handle unlimited numbers of rules etc: Produces: 'boy-cowboy-cat-dog-do-dog-caddy-eyes-cat-dog-boy-cats-eyes-cat', 'boy-cowboy-cat-dog-do-dog-caddy-eyes-cat-dog-boy-cat-dye-cat', 'boy-cow-boy-cat-dog-do-dog-caddy-eyes-cat-dog-boy-cats-eyes-cat', 'boy-cow-boy-cat-dog-do-dog-caddy-eyes-cat-dog-boy-cat-dye-cat', 'boy-cowboy-cat-dog-do-dog-cats-eyes-yes-cat-dog-boy-cats-eyes-cat', 'boy-cowboy-cat-dog-do-dog-cats-eyes-yes-cat-dog-boy-cat-dye-cat', 'boy-cowboy-cat-dog-do-dog-cat-dye-yes-cat-dog-boy-cats-eyes-cat', 'boy-cowboy-cat-dog-do-dog-cat-dye-yes-cat-dog-boy-cat-dye-cat', 'boy-cow-boy-cat-dog-do-dog-cats-eyes-yes-cat-dog-boy-cats-eyes-cat', 'boy-cow-boy-cat-dog-do-dog-cats-eyes-yes-cat-dog-boy-cat-dye-cat', 'boy-cow-boy-cat-dog-do-dog-cat-dye-yes-cat-dog-boy-cats-eyes-cat', 'boy-cow-boy-cat-dog-do-dog-cat-dye-yes-cat-dog-boy-cat-dye-cat' In under 1/100th of a second. Happy hunting! | [reply] [d/l] [select] |
|
Re: unglue words joined together by juncture rules
by pc2 (Beadle) on Mar 26, 2008 at 16:10 UTC | |
thank you for the answers. unfortunately, we don't have resources for researching about finite state transducers (which we have already heard about).we are trying to find a solution. we have already posted a question similar to this one (http://www.perlmonks.org/?node_id=658691), and the user GrandFather has given a very interesting partial solution (which uses a recursive search to find all matching combinations of cowboycatdog, based on the lexicon cowboy, cow, boy, cat, at, do, dog): which prints: cow-boy-cat-dog cowboy-cat-dog the only limitation with this method regarding this problem is that it doesn't consider that the words may have been joined together by phonetic rules (for example, cowboycaddog).so, would it be a good idea trying to adapt the above code, or creating one that uses a similar method? any suggestions? | [reply] [d/l] |
by grizzley (Chaplain) on Mar 27, 2008 at 10:46 UTC | |
You need only one more thing I think. And it is look-ahead assertion. Am I right, thinking, that you want match if next letter after 'cad' is 'd' or next phrase after 'namaH' is 'te'?
Update: Of course strings printed as result will include these assertions, but the problem how to remove (?=something) from output IMHO can be left as an exercise for the reader (always wanted to say that :) ) | [reply] [d/l] [select] |
|
Re: unglue words joined together by juncture rules
by mobiusinversion (Beadle) on Mar 28, 2008 at 02:16 UTC | |
the alphabet: and the juncture rules: There are 24 ungluings and I am posting them here. The algorithm I used is of a branch-and-bound genus, and employs a breadth first search on the decision space. Here are the details: Potential *ungluings* are added to a queue. Elements of the queue are rejected when no possible ungluings can be formed on their right hand side (i.e. their as-of-yet-not-unlgued-bit). The solution and runtime stats for this example are as follows: 0.0215 seconds 143 iterations largest queue memory usage: 15,612 bytes largest queue length: 24 items cow-boy-cat-dog-cow-cow-boy-boy-cat-do-dog-cat-do-do-cow-boy-toy-cats-eyes cowboy-cat-dog-cow-cow-boy-boy-cat-do-dog-cat-do-do-cow-boy-toy-cats-eyes cow-boy-cat-dog-cow-cowboy-boy-cat-do-dog-cat-do-do-cow-boy-toy-cats-eyes cow-boy-cat-dog-cow-cow-boy-boy-cat-do-dog-cat-do-do-cowboy-toy-cats-eyes cow-boy-cat-dog-cow-cow-boy-boy-cat-do-dog-cat-do-do-cow-boytoy-cats-eyes cowboy-cat-dog-cow-cowboy-boy-cat-do-dog-cat-do-do-cow-boy-toy-cats-eyes cowboy-cat-dog-cow-cow-boy-boy-cat-do-dog-cat-do-do-cowboy-toy-cats-eyes cowboy-cat-dog-cow-cow-boy-boy-cat-do-dog-cat-do-do-cow-boytoy-cats-eyes cow-boy-cat-dog-cow-cowboy-boy-cat-do-dog-cat-do-do-cowboy-toy-cats-eyes cow-boy-cat-dog-cow-cowboy-boy-cat-do-dog-cat-do-do-cow-boytoy-cats-eyes cow-boy-cat-dog-cow-cow-boy-boy-cat-do-dog-cat-do-do-cow-boy-toy-cat-dye cowboy-cat-dog-cow-cowboy-boy-cat-do-dog-cat-do-do-cowboy-toy-cats-eyes cowboy-cat-dog-cow-cowboy-boy-cat-do-dog-cat-do-do-cow-boytoy-cats-eyes cowboy-cat-dog-cow-cow-boy-boy-cat-do-dog-cat-do-do-cow-boy-toy-cat-dye cow-boy-cat-dog-cow-cowboy-boy-cat-do-dog-cat-do-do-cow-boy-toy-cat-dye cow-boy-cat-dog-cow-cow-boy-boy-cat-do-dog-cat-do-do-cowboy-toy-cat-dye cow-boy-cat-dog-cow-cow-boy-boy-cat-do-dog-cat-do-do-cow-boytoy-cat-dye cowboy-cat-dog-cow-cowboy-boy-cat-do-dog-cat-do-do-cow-boy-toy-cat-dye cowboy-cat-dog-cow-cow-boy-boy-cat-do-dog-cat-do-do-cowboy-toy-cat-dye cowboy-cat-dog-cow-cow-boy-boy-cat-do-dog-cat-do-do-cow-boytoy-cat-dye cow-boy-cat-dog-cow-cowboy-boy-cat-do-dog-cat-do-do-cowboy-toy-cat-dye cow-boy-cat-dog-cow-cowboy-boy-cat-do-dog-cat-do-do-cow-boytoy-cat-dye cowboy-cat-dog-cow-cowboy-boy-cat-do-dog-cat-do-do-cowboy-toy-cat-dye cowboy-cat-dog-cow-cowboy-boy-cat-do-dog-cat-do-do-cow-boytoy-cat-dye The code is relatively simple. Say the word and Ill post it. (In case you'd like to have the crack at it yourself). Could you share an actual example of your problem with us? | [reply] [d/l] [select] |
by pc2 (Beadle) on Mar 30, 2008 at 12:00 UTC | |
salutations, thank you for the answer, but we are not sure we understood what you meant in the penultimate line of your answer.EDIT (Mar 31): in the expression between parenthesis ("In case you'd like to have the crack at it yourself"). | [reply] |
by mobiusinversion (Beadle) on Mar 30, 2008 at 23:01 UTC | |
So "say the word and Ill post it" was unclear??? If thats what you you are referring to than I meant that if you wanted the code that solves your problem, Id give it to you if you said so. I just didnt want to spoil the fun (the description of the code in the first few paragraphs is more than enough to solve it in under an hour) edit (april 01): regarding "to have a crack at it", that is both american and australian slang meaning "to try something". You can see the urban dictionary definition. Other tagged definitions in the urban dictionary include: give it a go have a try jump right in have a go to try | [reply] [d/l] |
|
Re: unglue words joined together by juncture rules
by BrowserUk (Patriarch) on Mar 27, 2008 at 14:25 UTC | |
This needs adapting to handle the multi-character morphems in your Sanscrit example and a post process stage to convert morphed words to their lexicon spellings:
Produces:
A longer sample input with the related morphems and lexicon clearly identified (I don't know Sanscrit :), would allow better testing. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
|
Re: unglue words joined together by juncture rules
by pc2 (Beadle) on Mar 30, 2008 at 14:25 UTC | |
salutations, we shall give an actual example of the problem we are trying to solve, based on Sanskrit (which is the best language we can think of for this particular problem, for the many euphony rules it has). consider the lexicon of wordforms (which could be in the form of a hash, with associated meaning) where letter "A" (long vowel) is different from letter "a" (short vowel):also note that the words are in isolated forms, i. e. without any juncture rules. consider the following word: zivAzvaH and the phonetic rules, which occur between words and/or in the final of the sentence:so, for example, ziva + azvas would give zivAzvaH. zivA + azvas would give zivAzvaH. zivA + Azvas would give zivAzvaH. thus, the possible segmentations of zivAzvaH would be: of course, we only want to separate the possible words; whether it makes sense or not in the language is another story. there is yet another example of something we want it to be able to do (NOTE: this second example may be left as something to work on later, maybe): consider a language (which is actually what we are willing to experiment) with a word "abaca", and which has the following rules for joining words: exemplifying: we would like to analyse "ababAcaba" and get: this second situation seems much more complicated, but is not prioritary, maybe we should first concentrate on the first one. | [reply] [d/l] [select] |
by mobiusinversion (Beadle) on Mar 31, 2008 at 00:11 UTC | |
| [reply] |
by BrowserUk (Patriarch) on Mar 30, 2008 at 18:09 UTC | |
These are the input to and outputs from my code somewhere above, for the 3 examples you;ve supplied so far:
The main point of that code is that it constructs regexes to parse the data from the supplied lexicon and morpheme rules automatically. Incomplete yet, and currently leave work still to be done, but a starting point? The more examples it is tried with, the better the code generation can be tailored. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
by pc2 (Beadle) on Apr 01, 2008 at 00:37 UTC | |
salutations, thank you for the answer. we will analyze your code, it seems to be a good starting point. of course, it will take some time for us because our Perl knowledge is not very advanced yet. | [reply] |
by BrowserUk (Patriarch) on Apr 01, 2008 at 00:51 UTC | |
by mobiusinversion (Beadle) on Mar 31, 2008 at 05:21 UTC | |
Do that and Ill post the solution ;) | [reply] |
by pc2 (Beadle) on Mar 31, 2008 at 22:26 UTC | |
salutations, thank you for the attention. the problem is that this problem seems difficult to explain without examples. but that examples we gave (the Sanskrit one and the abacAcaba one) are actually what we want to make. so, we thought it would be easier to formulate with several examples (of course, we were wrong, because several complex examples make it difficult to give only one solution that solves everything, right?). trying to state the actual problem, what we want is to be able to take a string of words (any words) joined by whatever rules of combination we may want to create between words (vowel joining, additional euphonic phoneme, consonant swapping, assimilation...) and then separate this string into the possible combinations of words and rules that may have formed it. maybe, based on a lexicon of the isolated wordforms that may have formed the phrase.do you have a solution for it? | [reply] |
by mobiusinversion (Beadle) on Apr 01, 2008 at 04:21 UTC | |