in reply to unglue words joined together by juncture rules

salutations,

we shall give an actual example of the problem we are trying to solve, based on Sanskrit (which is the best language we can think of for this particular problem, for the many euphony rules it has). consider the lexicon of wordforms (which could be in the form of a hash, with associated meaning) where letter "A" (long vowel) is different from letter "a" (short vowel):

ziva => Shiva (a name for god) azvas => horse zivA => auspicious (f.) Azvas => equestrian
also note that the words are in isolated forms, i. e. without any juncture rules.

consider the following word: zivAzvaH

and the phonetic rules, which occur between words and/or in the final of the sentence:
a|a => A A|a => A a|A => A A| => A |A => A A|A => A s| => H
so, for example, ziva + azvas would give zivAzvaH. zivA + azvas would give zivAzvaH. zivA + Azvas would give zivAzvaH. thus, the possible segmentations of zivAzvaH would be:
ziva-azvas #meaning: Shiva's horse zivA-azvas #meaning: auspicious' horse ziva-Azvas #meaning: Shiva's equestrian zivA-Azvas #meaning: auspicious' equestrian
of course, we only want to separate the possible words; whether it makes sense or not in the language is another story.

there is yet another example of something we want it to be able to do (NOTE: this second example may be left as something to work on later, maybe): consider a language (which is actually what we are willing to experiment) with a word "abaca", and which has the following rules for joining words:

a + a = A. last consonant of first word + first consonant of last word swap.
exemplifying:
abaca + abaca = abaCa + aBaca = abaBa + aCaba (consonants swap) ababAcaba (final form)
we would like to analyse "ababAcaba" and get:
abaca-abaca
this second situation seems much more complicated, but is not prioritary, maybe we should first concentrate on the first one.

Replies are listed 'Best First'.
Re^2: unglue words joined together by juncture rules
by BrowserUk (Patriarch) on Mar 30, 2008 at 18:09 UTC

    These are the input to and outputs from my code somewhere above, for the 3 examples you;ve supplied so far:

    { my %morphs = ( t => { d => 'd' }, ); my @lex = qw[ cowboy cow boy cat do dog ]; my $input = 'cowboycaddog'; print "\n$input\n------------"; deGlue{ print join '-', @_ } @lex, %morphs, $input; } { my %morphs = ( aH => { o => 'dh', as => 't' }, as => { aH => '' }, ); my @lex = qw[ krishnaH dhaavati naH dhaa namaH te ]; my $input = 'Krishnodhaavatinamaste'; print "\n$input\n----------"; deGlue{ print join '-', @_ } @lex, %morphs, $input; } { my %morphs = ( A => { a => 'a', A => 'a', a => 'A', A => 'A', '' => 'A', 'A' +=> '' }, s => { H => '' }, ); my @lex = qw[ ziva Shiva azvas zivA Azvas ]; my $input = 'zivAzvaH'; print "\n$input\n-------------"; deGlue{ print join '-', @_ } @lex, %morphs, $input; } __END__ c:\test>675520 cowboycaddog ------------ cowboy-cad-dog cow-boy-cad-dog Krishnodhaavatinamaste ---------- Krishno-dhaavati-namas-te zivAzvaH ------------- ziv-AzvaH ziv-AzvaH-AzvaH ## I'M investigating this anomoly.

    The main point of that code is that it constructs regexes to parse the data from the supplied lexicon and morpheme rules automatically.

    Incomplete yet, and currently leave work still to be done, but a starting point? The more examples it is tried with, the better the code generation can be tailored.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      salutations,

      thank you for the answer. we will analyze your code, it seems to be a good starting point. of course, it will take some time for us because our Perl knowledge is not very advanced yet.

        Your welcome, hope it helps. The code is fairly advanced, so if an explanation of any of it will help you, please ask.

        The code could be improved by seeing a few more examples of the simple case. The consonant swapping type will take a different approach I think.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re^2: unglue words joined together by juncture rules
by mobiusinversion (Beadle) on Mar 31, 2008 at 00:11 UTC
    By the way, if anyone else is interested in transliterated Sanskrit, this post is most likely regarding the Harvard-Kyoto standard, which can be found here
Re^2: unglue words joined together by juncture rules
by mobiusinversion (Beadle) on Mar 31, 2008 at 05:21 UTC
    Would you simply state the problem you are trying to solve, rather than progressively complex examples?

    Do that and Ill post the solution ;)

      salutations,

      thank you for the attention.

      the problem is that this problem seems difficult to explain without examples. but that examples we gave (the Sanskrit one and the abacAcaba one) are actually what we want to make. so, we thought it would be easier to formulate with several examples (of course, we were wrong, because several complex examples make it difficult to give only one solution that solves everything, right?).

      trying to state the actual problem, what we want is to be able to take a string of words (any words) joined by whatever rules of combination we may want to create between words (vowel joining, additional euphonic phoneme, consonant swapping, assimilation...) and then separate this string into the possible combinations of words and rules that may have formed it. maybe, based on a lexicon of the isolated wordforms that may have formed the phrase.

      do you have a solution for it?

        ill provide it shortly. please stay tuned.

        in the meanwhile, i recommend getting comfortable with Perl's regular expression variables, and the qr// operator (very useful!). Here is an excerpt from the Perl 5.10 documentation.
        VARIABLES $_ Default variable for operators to use $` Everything prior to matched string $& Entire matched string $' Everything after to matched string $1,$2... Hold the Xth captured expr $+ Last parenthesized pattern match $^N Holds the most recently closed capture $^R Holds the result of the last (?{...}) expr @- Offsets of starts of groups. @+ Offsets of ends of groups.
        Here is an application you will need:
        use strict; use Data::Dumper; my $con = qr/[b-df-hj-np-tv-xz]/; my $vow = qr/[aeiouy]/; my $ncon = qr/[^b-df-hj-np-tv-xz]/; my $nvow = qr/[^aeiouy]/; my $x = 'battlestar galactica'; my $y = 'silly ahab'; ($x,$y) = map{swap($_)}($x,$y); print Dumper([$x,$y]); sub swap { my $x = shift; if($x =~ /(${con})(${ncon}*?\b)(${ncon}*?)(${con})/){ $x = $`.$4.$2.$3.$1.$'; } $x }
        Produces:
        $VAR1 = [ 'battlestag ralactica', 'silhy alab' ];