unglue words joined together by juncture rules

pc2 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: unglue words joined together by juncture rules by roboticus (Chancellor) on Mar 21, 2008 at 20:57 UTC
pc2: The bioinformatics crowd seems to do a lot of work around this sort of problem ... matching up fragments of DNA and genes and other squishy bits. You might look around and see what sort of things they're working on. ...roboticus	[reply]
Re: unglue words joined together by juncture rules by Narveson (Chaplain) on Mar 21, 2008 at 21:56 UTC
Save your juncture rules in a hash. my %juncture_of = ( inm => 'imm', # immature abt => 'abst', # abstract adt => 'att', # attempt # etc. ); sub join_euphoniously { my $compound = join '', @_; while ( my ($raw, $joined) = each %juncture_of ) { $compound =~ s/$raw/$joined/; } return $compound; } sub unjoin { my $compound = shift; # maybe no juncture rules applied to this one my @possibilities = ( $compound ); # analyze whether $string resulted from a juncture rule while (my ($raw, $joined) = each %juncture_of ) { for my $possibility (@possibilities) { if ( $possibility =~ s/$joined/$raw/ ) { push @possibilities, $possibility } } } return \@possibilities; } [download] Still to do: decompose each possibility into dictionary words.	[reply] [d/l]
Re^2: unglue words joined together by juncture rules by oko1 (Deacon) on Mar 21, 2008 at 23:29 UTC
> Still to do: decompose each possibility into dictionary words. This just happens to be similar to something I've been fiddling with recently. `#!/usr/bin/perl -w my $str = "cowboycatdog"; chomp(my @list = sort { length($b) <=> length($a) } <DATA>); for my $word (@list){ my $tmp = $str; next unless $tmp =~ s/$word//; my @results; push @results, $word; my @rem = grep ! /^$word$/, @list; for my $w (@rem){ push @results, $w if $tmp =~ s/$w//; } next if length($tmp); push @out, \@results; } print "'$str' has the following anagrams:\n\n", map "@$_\n", @out; __DATA__ cowboy cow boy cat do dog` [download] The output looks like this: 'cowboycatdog' has the following anagrams: cowboy cat dog cow boy cat dog boy cow cat dog cat cowboy dog dog cowboy cat	[reply] [d/l]
Re: unglue words joined together by juncture rules by pc2 (Beadle) on Mar 22, 2008 at 12:08 UTC
thank you for the responses. more suggestions are welcome. let us explain the problem further. actually, what we are wanting to make is a word separator, based on a lexicon and a set of phonetic rules (of any kind, for example, I am -> I'm, vowel + vowel = long vowel (a + a = ā; a + a = aya)) an example of language that has these characteristics is Sanskrit. for example, "Krishnah" + "dhaavati" = "Krishnodhaavati", "namaH" + "te" = "namaste" (aH\|dh -> odh; aH\|t -> ast; as\| -> aH (as at the end of the string turns to aH)), etc. for making things easier, the lexicons used to form the strings are generally quite small, and how we generate it is actually irrelevant for this problem. for example, krishnodhaavati would be analysed by means of the lexicon ('krishnaH', 'dhaavati', 'naH', 'dhaa') (the set of chunks of the string that exist in the dictionary), but how we generate this lexicon doesn't matter, we just want to join its words based on a set of phonetic rules (aH\|dh -> odh, a\|e -> aye, a\|e -> ai, aH\|t -> ast, aH\|n -> on, for example) in order to obtain back the original string segmented in all its possibilities (namaH-te; krishnaH-dhaavati; namaH-namaH; etc.). of course, there will not always be only one possibility of segmentation, given a big quantity of juncture rules and a (relatively) big lexicon of possible morphemes.	[reply]
Re^2: unglue words joined together by juncture rules by halley (Prior) on Mar 24, 2008 at 18:42 UTC
I am hoping that this is for a good cause, but the cynic in me is wondering if this is somehow related to making those google-spamming search websites. You know, they end up making ads for "*Lowest prices for Cowboy Cat Dog* at `www.cowboy-cat-dog.com`!**" I'm glad that the free domain-tasting practice is falling out of favor with the registrars. -- `[ e d @ h a l l e y . c c ]`	[reply] [d/l]
Re^3: unglue words joined together by juncture rules by pc2 (Beadle) on Mar 26, 2008 at 11:53 UTC
funny answer. but we can assure you that it is definitely not what we want to do; we HATE when we receive those annoying spams and see those things in Google, it is like we are surrounded by bad people who only want to trick us and get away with this. it is just for a personal project of a morphological analyser. we are not mad at you, we perfectly understand your cynicism... it is hard those days to have good faith in humanity.	[reply]
Re: unglue words joined together by juncture rules by benizi (Hermit) on Mar 25, 2008 at 08:18 UTC
The "right" way to do this is to use Finite State Transducers. They're used quite a bit in morphological analysis (deconstructing a word into its morphemes). I enjoyed Finite State Morphology, by Lauri Karttunen and Kenneth R. Beesley. A lot of the material you'll find will be very academic, and the field is a bit Finnish-heavy (It has far richer morphology than English). But one of the attractive features of the technology is its run-time efficiency. There are a couple widely-used toolkits: Xerox Finite State Toolkit, which comes with the book I mentioned above. (might have licensing issues). And the MIT FST Toolkit. Some relevant acronyms are WFST, FSA, FSM, and FST for weighted finite state transducers, finite state automata, finite state machines, and finite state transducers. I'm pretty sure Google has a toolkit that's relevant, but I can't seem to find it (I think it uses yet-another acronym for a class of machines that contains WFST's.) None of these is a Perl solution. Update: fixed Wikipedia link Update 2: The Google Research-related kit is OpenFST.	[reply]
Re: unglue words joined together by juncture rules by Gavin (Archbishop) on Mar 22, 2008 at 14:44 UTC
Have you looked at the Prolog modules on CPAN	[reply]
Re: unglue words joined together by juncture rules by mobiusinversion (Beadle) on Apr 01, 2008 at 05:06 UTC
Here it is. First, what you need to know: You will be responsible for installing your rules into the ungluer via a dispatch table of anonymous subroutines. Please note: I consider that to be intermediate Perl programming. So if you are looking for something trivial, stop reading now. These functions should return an array of arrays. In their simplest form, these functions could take in 1 word and return 2. Keep this in mind: You will be installing inverted associations. So instead of installing rules of the form: `t\|d => dd` [download] you will be installing rules of the form: `dd => t\|d` [download] The solution is in a subroutine (below) called functional_unglue and is called like this: `my @results = functional_unglue( @arguments );` [download] Here is a template you should use: `use strict; use Data::Dumper; my @results = functional_unglue ( target => TARGET_STRING, lexicon => { morpheme_1 => undef, morpheme_2 => undef, ... morpheme_n => undef, }, pre_images => { pre_image_function_1 => sub { CODE }, pre_image_function_2 => sub { CODE }, ... pre_image_function_n => sub { CODE }, } ); @results = map{join('-',@$_)}@results; print Dumper([@results]);` [download] If you think about it, that is the only way a solution makes sense; You want your models to be extensible and you want to be able to change your minds later, and so it must be up to you to install new rules. Here is what you get in return: This solution promises to apply all of your rule sets, and recover all of the possible ungluings, in the fastest and most memory efficient way possible, without any knowledge of the rules themselves. Here it is: `sub functional_unglue { my %x = @_; my $x = $x{target}; my $l = $x{lexicon}; my $f = $x{pre_images}; my @q = ([$x]); my @r; while(@q){ my $t = shift @q; my @w = @{$t}; my $w = pop @w; for(keys %$f){ if(my @y = $f->{$_}->($w,$l,$f)){ result: for my $i(0..$#y){ my $n = $#{$y[$i]} == 0 ? 0 : $#{$y[$i]} - 1; for my $j(0..$n){ next result unless exists $l->{$y[$i][$j]} } push @q, [@w,@{$y[$i]}] } } } if(exists $l->{$t->[$#{$t}]}){ push @r, $t } } @r }` [download] So, for example, use it like this: use strict; use Data::Dumper; my @z = functional_unglue ( target => 'cowboycaddog', lexicon => { cowboy => undef, cow => undef, boy => undef, cat => undef, dog => undef, }, pre_images => { concat => sub { ### this turns one long word into two my($x,$l) = @_; my @x; for(keys %$l){ my $s = $_; if($x =~ /^\Q$s\E/){ push @x, [$s,substr($x,length($s))]; } } @x }, simple => sub { ### this turns "WORD1ddWord2" into [WORD1t, dWORD2] my $x = shift; my @x; my %x = ( dd => [ [qw(t d)], ], ); ### it can handle arbitrary substitions, ### not just dd => t\|d, but XY => A\|B ### for any strings X,Y,A,B for(keys %x){ my $s = $_; if($x =~ /\Q$s\E/){ for my $i(0..$#{$x{$s}}){ my @y = ( $`.$x{$s}->[$i][0], $x{$s}->[$i][1].$' ); push @x, length($y[1]) ? [@y] : [$y[0]] } } } @x }, } ); @z = map{join('-',@$_)}@z; print Dumper([@z]); [download] Produces: `$VAR1 = [ 'cowboy-cat-dog', 'cow-boy-cat-dog' ];` [download] Or you could try this: my @z = functional_unglue ( target => 'zivAzvaH', lexicon => { ziva => 1, azvas => 1, zivA => 1, Azvas => 1, }, pre_images => { simple => sub { my $x = shift; my @x; my %x = ( A => [[qw(a a)], [qw(A a)], [qw(a A)], ['A',''], ['','A'], [qw(A A)]], H => [['s','']], ); for(keys %x){ my $s = $_; if($x =~ /\Q$s\E/){ for my $i(0..$#{$x{$s}}){ my @y = ($`.$x{$s}->[$i][0], $x{$s}->[$i][1].$'); push @x, length($y[1]) ? [@y] : [$y[0]] } } } @x }, } ); @z = map{join('-',@$_)}@z; print Dumper([@z]); [download] Produces: `$VAR1 = [ 'ziva-azvas', 'zivA-azvas', 'ziva-Azvas', 'zivA-Azvas' ];` [download] Now why this works and how this works is your job to figure out. (It sounds like you need to get some more advanced books on Perl, try Object Oriented Perl and Higher Order Perl) Try the consonant transposition problem on your own. If you get stuck after trying for a few days, post back, and I'll show you how. Finally, look how easy it is to handle unlimited numbers of rules etc: use strict; use Data::Dumper; my @z = functional_unglue ( target => 'boycowboycaddogdodogcaddyeyescatdogboycaddyecatt', lexicon => { cowboy => undef, caddy => undef, boytoy => undef, co => undef, cow => undef, boy => undef, cat => undef, cats => undef, do => undef, dog => undef, dye => undef, eddie => undef, eyes => undef, kowtow => undef, toy => undef, tow => undef, tyco => undef, yes => undef, ye => undef, }, pre_images => { concat => sub { my($x,$l) = @_; my @x; for(keys %$l){ my $s = $_; if($x =~ /^\Q$s\E/){ push @x, [$s,substr($x,length($s))]; } } @x }, simple => sub { my $x = shift; my @x; my %x = ( dd => [[qw(t d)]], db => [['d','']], td => [[qw(d t)]], td => [[qw(d d)]], tt => [['t','']], ye => [[qw(eye eye)]], caddye => [[qw(cats eyes)]], ); for(keys %x){ my $s = $_; if($x =~ /\Q$s\E/){ for my $i(0..$#{$x{$s}}){ my @y = ( $`.$x{$s}->[$i][0], $x{$s}->[$i][1].$' ); push @x, length($y[1]) ? [@y] : [$y[0]] } } } @x }, } ); @z = map{join('-',@$_)}@z; print Dumper([@z]); [download] Produces: 'boy-cowboy-cat-dog-do-dog-caddy-eyes-cat-dog-boy-cats-eyes-cat', 'boy-cowboy-cat-dog-do-dog-caddy-eyes-cat-dog-boy-cat-dye-cat', 'boy-cow-boy-cat-dog-do-dog-caddy-eyes-cat-dog-boy-cats-eyes-cat', 'boy-cow-boy-cat-dog-do-dog-caddy-eyes-cat-dog-boy-cat-dye-cat', 'boy-cowboy-cat-dog-do-dog-cats-eyes-yes-cat-dog-boy-cats-eyes-cat', 'boy-cowboy-cat-dog-do-dog-cats-eyes-yes-cat-dog-boy-cat-dye-cat', 'boy-cowboy-cat-dog-do-dog-cat-dye-yes-cat-dog-boy-cats-eyes-cat', 'boy-cowboy-cat-dog-do-dog-cat-dye-yes-cat-dog-boy-cat-dye-cat', 'boy-cow-boy-cat-dog-do-dog-cats-eyes-yes-cat-dog-boy-cats-eyes-cat', 'boy-cow-boy-cat-dog-do-dog-cats-eyes-yes-cat-dog-boy-cat-dye-cat', 'boy-cow-boy-cat-dog-do-dog-cat-dye-yes-cat-dog-boy-cats-eyes-cat', 'boy-cow-boy-cat-dog-do-dog-cat-dye-yes-cat-dog-boy-cat-dye-cat' In under 1/100th of a second. Happy hunting!	[reply] [d/l] [select]
Re: unglue words joined together by juncture rules by pc2 (Beadle) on Mar 26, 2008 at 16:10 UTC
thank you for the answers. unfortunately, we don't have resources for researching about finite state transducers (which we have already heard about). we are trying to find a solution. we have already posted a question similar to this one (http://www.perlmonks.org/?node_id=658691), and the user GrandFather has given a very interesting partial solution (which uses a recursive search to find all matching combinations of cowboycatdog, based on the lexicon cowboy, cow, boy, cat, at, do, dog): use strict; use warnings; my $target = "cowboycatdog"; my @partsList = qw(cowboy cow boy cat at do dog); my %partsLu; ++$partsLu{$_} for @partsList; search ($target, {%partsLu}, []); sub search { my ($target, $partsLu, $used) = @_; unless (length $target) { print join ("-", @$used), "\n"; return; } for my $part (keys %$partsLu) { next unless 0 == index $target, $part; my $remainder = substr $target, length $part; delete $partsLu->{$part} unless --$partsLu->{$part}; search ($remainder, {%$partsLu}, [@$used, $part]); } } [download] which prints: cow-boy-cat-dog cowboy-cat-dog the only limitation with this method regarding this problem is that it doesn't consider that the words may have been joined together by phonetic rules (for example, cowboycaddog). so, would it be a good idea trying to adapt the above code, or creating one that uses a similar method? any suggestions?	[reply] [d/l]
Re^2: unglue words joined together by juncture rules by grizzley (Chaplain) on Mar 27, 2008 at 10:46 UTC
You need only one more thing I think. And it is look-ahead assertion. Am I right, thinking, that you want match if next letter after 'cad' is 'd' or next phrase after 'namaH' is 'te'? #!perl -l use strict; use warnings; my $target = "cowboycaddog"; my @partsList = ('cowboy', 'cow', 'boy', 'cat', 'at', 'do', 'dog', 'ca +d(?=d)'); my %partsLu; ++$partsLu{$_} for @partsList; search ($target, {%partsLu}, []); sub search { my ($target, $partsLu, $used) = @_; unless (length $target) { print join ("-", @$used), "\n"; return; } for my $part (keys %$partsLu) { my $tmp = $target; my $re = qr/$part/; next unless $tmp=~s/^$re//; delete $partsLu->{$part} unless --$partsLu->{$part}; search ($tmp, {%$partsLu}, [@$used, $part]); } } [download] Update: Of course strings printed as result will include these assertions, but the problem how to remove `(?=something)` from output IMHO can be left as an exercise for the reader (always wanted to say that :) )	[reply] [d/l] [select]
Re: unglue words joined together by juncture rules by mobiusinversion (Beadle) on Mar 28, 2008 at 02:16 UTC
Consider the word: `'cowboycaddogcowcowboyboycatdodogcaddodocowboytoycaddye'` [download] the alphabet: `cowboy caddy boytoy co cow boy cat cats do dog dye eddie eyes kowtow toy tow tyco` [download] and the juncture rules: `t\|d => dd cats\|eyes => caddye` [download] There are 24 ungluings and I am posting them here. The algorithm I used is of a branch-and-bound genus, and employs a breadth first search on the decision space. Here are the details: Potential ungluings are added to a queue. Elements of the queue are rejected when no possible ungluings can be formed on their right hand side (i.e. their as-of-yet-not-unlgued-bit). The solution and runtime stats for this example are as follows: 0.0215 seconds 143 iterations largest queue memory usage: 15,612 bytes largest queue length: 24 items cow-boy-cat-dog-cow-cow-boy-boy-cat-do-dog-cat-do-do-cow-boy-toy-cats-eyes cowboy-cat-dog-cow-cow-boy-boy-cat-do-dog-cat-do-do-cow-boy-toy-cats-eyes cow-boy-cat-dog-cow-cowboy-boy-cat-do-dog-cat-do-do-cow-boy-toy-cats-eyes cow-boy-cat-dog-cow-cow-boy-boy-cat-do-dog-cat-do-do-cowboy-toy-cats-eyes cow-boy-cat-dog-cow-cow-boy-boy-cat-do-dog-cat-do-do-cow-boytoy-cats-eyes cowboy-cat-dog-cow-cowboy-boy-cat-do-dog-cat-do-do-cow-boy-toy-cats-eyes cowboy-cat-dog-cow-cow-boy-boy-cat-do-dog-cat-do-do-cowboy-toy-cats-eyes cowboy-cat-dog-cow-cow-boy-boy-cat-do-dog-cat-do-do-cow-boytoy-cats-eyes cow-boy-cat-dog-cow-cowboy-boy-cat-do-dog-cat-do-do-cowboy-toy-cats-eyes cow-boy-cat-dog-cow-cowboy-boy-cat-do-dog-cat-do-do-cow-boytoy-cats-eyes cow-boy-cat-dog-cow-cow-boy-boy-cat-do-dog-cat-do-do-cow-boy-toy-cat-dye cowboy-cat-dog-cow-cowboy-boy-cat-do-dog-cat-do-do-cowboy-toy-cats-eyes cowboy-cat-dog-cow-cowboy-boy-cat-do-dog-cat-do-do-cow-boytoy-cats-eyes cowboy-cat-dog-cow-cow-boy-boy-cat-do-dog-cat-do-do-cow-boy-toy-cat-dye cow-boy-cat-dog-cow-cowboy-boy-cat-do-dog-cat-do-do-cow-boy-toy-cat-dye cow-boy-cat-dog-cow-cow-boy-boy-cat-do-dog-cat-do-do-cowboy-toy-cat-dye cow-boy-cat-dog-cow-cow-boy-boy-cat-do-dog-cat-do-do-cow-boytoy-cat-dye cowboy-cat-dog-cow-cowboy-boy-cat-do-dog-cat-do-do-cow-boy-toy-cat-dye cowboy-cat-dog-cow-cow-boy-boy-cat-do-dog-cat-do-do-cowboy-toy-cat-dye cowboy-cat-dog-cow-cow-boy-boy-cat-do-dog-cat-do-do-cow-boytoy-cat-dye cow-boy-cat-dog-cow-cowboy-boy-cat-do-dog-cat-do-do-cowboy-toy-cat-dye cow-boy-cat-dog-cow-cowboy-boy-cat-do-dog-cat-do-do-cow-boytoy-cat-dye cowboy-cat-dog-cow-cowboy-boy-cat-do-dog-cat-do-do-cowboy-toy-cat-dye cowboy-cat-dog-cow-cowboy-boy-cat-do-dog-cat-do-do-cow-boytoy-cat-dye The code is relatively simple. Say the word and Ill post it. (In case you'd like to have the crack at it yourself). Could you share an actual example of your problem with us?	[reply] [d/l] [select]
Re^2: unglue words joined together by juncture rules by pc2 (Beadle) on Mar 30, 2008 at 12:00 UTC
salutations, thank you for the answer, but we are not sure we understood what you meant in the penultimate line of your answer. EDIT (Mar 31): in the expression between parenthesis ("In case you'd like to have the crack at it yourself").	[reply]
Re^3: unglue words joined together by juncture rules by mobiusinversion (Beadle) on Mar 30, 2008 at 23:01 UTC
Penultimate, meaning second to last? `Online Etymology Dictionary penultimate (adj.) 1677, from earlier penultima (n.) "the next to the last syllable of a +word or verse," from fem. of L. adj. penultimus "next-to-last," from +pęne "almost" + ultimus "final." Online Etymology Dictionary, - Douglas Harper` [download] So "say the word and Ill post it" was unclear??? If thats what you you are referring to than I meant that if you wanted the code that solves your problem, Id give it to you if you said so. I just didnt want to spoil the fun (the description of the code in the first few paragraphs is more than enough to solve it in under an hour) edit (april 01): regarding "to have a crack at it", that is both american and australian slang meaning "to try something". You can see the urban dictionary definition. Other tagged definitions in the urban dictionary include: give it a go have a try jump right in have a go to try	[reply] [d/l]
Re: unglue words joined together by juncture rules by BrowserUk (Patriarch) on Mar 27, 2008 at 14:25 UTC
This needs adapting to handle the multi-character morphems in your Sanscrit example and a post process stage to convert morphed words to their lexicon spellings: #! perl -slw use strict; use Data::Dump qw[ pp ]; sub deGlue (&\@\%@) { use re 'eval'; my $codeRef = shift; my $callback = sub { my $s = $_; my @words = map{ defined $-[ $_ ] && defined $+[ $_ ] ? substr( $s, $-[ $_ ], $+[ $_ ] - $-[ $_ ] ) : () } 1 .. $#-; $codeRef->( @words ); }; my @lex = @{ shift() }; my $morphRef = shift; for ( @lex ) { my( $pre, $last ) = m[(.*)(.)]; my $morph = $morphRef->{ $last } or next; $_ = "$pre(?:$last\|$morph->[ 0 ](?=$morph->[1]))" } my $re = qr[ ^ (?:( ${ \ join( ')\|(', @lex ) } ))+ $ (??{ $callback->() }) (?!) ]x; m[$re] for @_; return; } my %morphs = ( t => [ 'd' , 'd' ], ); my @lex = qw[cowboy cow boy cat do dog ]; my $input = 'cowboycaddog'; deGlue{ print join '-', @_ } @lex, %morphs, $input; [download] Produces: `c:\test>675520 cowboy-cad-dog cow-boy-cad-dog` [download] A longer sample input with the related morphems and lexicon clearly identified (I don't know Sanscrit :), would allow better testing. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l] [select]
Re: unglue words joined together by juncture rules by pc2 (Beadle) on Mar 30, 2008 at 14:25 UTC
salutations, we shall give an actual example of the problem we are trying to solve, based on Sanskrit (which is the best language we can think of for this particular problem, for the many euphony rules it has). consider the lexicon of wordforms (which could be in the form of a hash, with associated meaning) where letter "A" (long vowel) is different from letter "a" (short vowel): `ziva => Shiva (a name for god) azvas => horse zivA => auspicious (f.) Azvas => equestrian` [download] also note that the words are in isolated forms, i. e. without any juncture rules. consider the following word: zivAzvaH and the phonetic rules, which occur between words and/or in the final of the sentence: `a\|a => A A\|a => A a\|A => A A\| => A \|A => A A\|A => A s\| => H` [download] so, for example, ziva + azvas would give zivAzvaH. zivA + azvas would give zivAzvaH. zivA + Azvas would give zivAzvaH. thus, the possible segmentations of zivAzvaH would be: `ziva-azvas #meaning: Shiva's horse zivA-azvas #meaning: auspicious' horse ziva-Azvas #meaning: Shiva's equestrian zivA-Azvas #meaning: auspicious' equestrian` [download] of course, we only want to separate the possible words; whether it makes sense or not in the language is another story. there is yet another example of something we want it to be able to do (NOTE: this second example may be left as something to work on later, maybe): consider a language (which is actually what we are willing to experiment) with a word "abaca", and which has the following rules for joining words: `a + a = A. last consonant of first word + first consonant of last word swap.` [download] exemplifying: `abaca + abaca = abaCa + aBaca = abaBa + aCaba (consonants swap) ababAcaba (final form)` [download] we would like to analyse "ababAcaba" and get: `abaca-abaca` [download] this second situation seems much more complicated, but is not prioritary, maybe we should first concentrate on the first one.	[reply] [d/l] [select]
Re^2: unglue words joined together by juncture rules by mobiusinversion (Beadle) on Mar 31, 2008 at 00:11 UTC
By the way, if anyone else is interested in transliterated Sanskrit, this post is most likely regarding the Harvard-Kyoto standard, which can be found here	[reply]
Re^2: unglue words joined together by juncture rules by BrowserUk (Patriarch) on Mar 30, 2008 at 18:09 UTC
These are the input to and outputs from my code somewhere above, for the 3 examples you;ve supplied so far: { my %morphs = ( t => { d => 'd' }, ); my @lex = qw[ cowboy cow boy cat do dog ]; my $input = 'cowboycaddog'; print "\n$input\n------------"; deGlue{ print join '-', @_ } @lex, %morphs, $input; } { my %morphs = ( aH => { o => 'dh', as => 't' }, as => { aH => '' }, ); my @lex = qw[ krishnaH dhaavati naH dhaa namaH te ]; my $input = 'Krishnodhaavatinamaste'; print "\n$input\n----------"; deGlue{ print join '-', @_ } @lex, %morphs, $input; } { my %morphs = ( A => { a => 'a', A => 'a', a => 'A', A => 'A', '' => 'A', 'A' +=> '' }, s => { H => '' }, ); my @lex = qw[ ziva Shiva azvas zivA Azvas ]; my $input = 'zivAzvaH'; print "\n$input\n-------------"; deGlue{ print join '-', @_ } @lex, %morphs, $input; } __END__ c:\test>675520 cowboycaddog ------------ cowboy-cad-dog cow-boy-cad-dog Krishnodhaavatinamaste ---------- Krishno-dhaavati-namas-te zivAzvaH ------------- ziv-AzvaH ziv-AzvaH-AzvaH ## I'M investigating this anomoly. [download] The main point of that code is that it constructs regexes to parse the data from the supplied lexicon and morpheme rules automatically. Incomplete yet, and currently leave work still to be done, but a starting point? The more examples it is tried with, the better the code generation can be tailored. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l]
Re^3: unglue words joined together by juncture rules by pc2 (Beadle) on Apr 01, 2008 at 00:37 UTC
salutations, thank you for the answer. we will analyze your code, it seems to be a good starting point. of course, it will take some time for us because our Perl knowledge is not very advanced yet.	[reply]
Re^4: unglue words joined together by juncture rules by BrowserUk (Patriarch) on Apr 01, 2008 at 00:51 UTC
Re^2: unglue words joined together by juncture rules by mobiusinversion (Beadle) on Mar 31, 2008 at 05:21 UTC
Would you simply state the problem you are trying to solve, rather than progressively complex examples? Do that and Ill post the solution ;)	[reply]
Re^3: unglue words joined together by juncture rules by pc2 (Beadle) on Mar 31, 2008 at 22:26 UTC
salutations, thank you for the attention. the problem is that this problem seems difficult to explain without examples. but that examples we gave (the Sanskrit one and the abacAcaba one) are actually what we want to make. so, we thought it would be easier to formulate with several examples (of course, we were wrong, because several complex examples make it difficult to give only one solution that solves everything, right?). trying to state the actual problem, what we want is to be able to take a string of words (any words) joined by whatever rules of combination we may want to create between words (vowel joining, additional euphonic phoneme, consonant swapping, assimilation...) and then separate this string into the possible combinations of words and rules that may have formed it. maybe, based on a lexicon of the isolated wordforms that may have formed the phrase. do you have a solution for it?	[reply]
Re^4: unglue words joined together by juncture rules by mobiusinversion (Beadle) on Apr 01, 2008 at 04:21 UTC