Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Re: Re: Perl and Linguistics

by Hanamaki (Chaplain)
on May 26, 2002 at 16:21 UTC ( [id://169406]=note: print w/replies, xml ) Need Help??


in reply to Re: Perl and Linguistics
in thread Perl and Linguistics

I have used C++ based tools on linux for Japanese morphological analysis, such as chasen. Such tools are critical in Japanese and are used in indexing for a search engine ...

While I use Chasen -- which has by the way some rudimentary Perl bindings -- almost everday I am curious wether Chasen is really a good choice for a search engine? Thinking about speed, the difficulties to update Chasen's dictionary and to tune it for specific (topic) domains it would be nice to hear more about your experience with Chasen.

For simple search engines I usually prefer a simple longest match algorithm as provided by Kakasi or my own tools.

Replies are listed 'Best First'.
Re: Re: Re: Perl and Linguistics
by mattr (Curate) on May 28, 2002 at 09:10 UTC
    Sorry I do use kakasi as main tool in search engine now. But I use chasen sometimes for individual documents since I am under the impression that it is slower, more flexible, more sophisticated. I just mentioned Chasen because I remembered Nara and clustering, and that gave me chasen.

    For those who are not familiar with either tool, they are morphological analyzers of Japanese text. They are similar, though and generally are used to split a chunk of text into individual words (Japanese words are not usually separated by spaces) and to get the phonetic reading of those words (usually in roman alphabet).

    Obviously this is enabling technology. The name of Kakasi in fact is a kind of palindrome, in that read backwards phonetically you get the name of a popular front end processor which will take roman alphabet input and interactively pick the correct characters based on that phonetic reading and the context.

    I believe Kakasi is focussed more on workaday speed and useability while chasen might be more flexible. In particular there is some interesting use of chasen in document clustering work done in Nara and elsewhere I seem to remember. Couldn't find the exact page but google will help you look at the field. Personally where I use these tools is in custom search engines I build, usually either completely in Perl or with plugins from projects like the above. They are mainly useful it seems in building an inverted index to search a lot of text quickly but I have a small (a few megabytes) Japanese database that works fine just with (Japanese) regexes.

    I think it would be very interesting if Perl programmers could easily use state of the art computational linguistics or "A.I." algorithms (besides I guess what are already in perl) to make perl even more intelligent and perhaps automate some of the programming task. For example someone just gave me three nasty scripts to refactor together and update for 5.6.1, maybe perl could learn to tell me "Yep, those are real nasty scripts, better rewrite from scratch," or perhaps give me other insights into the code.

    I am no a computational linguist, just interested. There is an awful lot of science there, so if anybody has insights about it please share with the rest of us.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://169406]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (8)
As of 2024-03-28 11:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found