Sorry I do use kakasi as main tool in search engine now.
But I use chasen sometimes for individual documents since
I am under the impression that it is slower, more flexible, more sophisticated.
I just mentioned Chasen because I remembered Nara and clustering,
and that gave me chasen.
For those who are not familiar with either tool, they are
morphological analyzers of Japanese text. They are similar, though
and generally are used to split a chunk of text into individual
words (Japanese words are not usually separated by spaces) and
to get the phonetic reading of those words (usually in roman
alphabet).
Obviously this is enabling technology. The name
of Kakasi in fact is a kind of palindrome, in that read backwards
phonetically you get the name of a popular front end processor which
will take roman alphabet input and interactively pick the correct
characters based on that phonetic reading and the context.
I believe Kakasi is focussed more on workaday speed and useability
while chasen might be more flexible. In particular there
is some interesting use of chasen in document clustering
work done in Nara and elsewhere I seem to remember. Couldn't
find the exact page but google will help you look at the field.
Personally where I use these tools is in custom search engines I
build, usually either completely in Perl or with plugins from
projects like the above. They are mainly useful it seems
in building an inverted index to search a lot of text quickly
but I have a small (a few megabytes) Japanese database that
works fine just with (Japanese) regexes.
I think it would be very interesting if Perl programmers could
easily use state of the art computational linguistics or "A.I." algorithms
(besides I guess what are already in perl) to make perl even
more intelligent and perhaps automate some of the programming
task. For example someone just gave me three nasty scripts to
refactor together and update for 5.6.1, maybe perl could learn
to tell me "Yep, those are real nasty scripts, better rewrite from
scratch," or perhaps give me other insights into the code.
I am no a computational linguist, just interested. There is
an awful lot of science there, so if anybody has insights about
it please share with the rest of us.
|