comment on

Sorry I do use kakasi as main tool in search engine now. But I use chasen sometimes for individual documents since I am under the impression that it is slower, more flexible, more sophisticated. I just mentioned Chasen because I remembered Nara and clustering, and that gave me chasen.

For those who are not familiar with either tool, they are morphological analyzers of Japanese text. They are similar, though and generally are used to split a chunk of text into individual words (Japanese words are not usually separated by spaces) and to get the phonetic reading of those words (usually in roman alphabet).

Obviously this is enabling technology. The name of Kakasi in fact is a kind of palindrome, in that read backwards phonetically you get the name of a popular front end processor which will take roman alphabet input and interactively pick the correct characters based on that phonetic reading and the context.

I believe Kakasi is focussed more on workaday speed and useability while chasen might be more flexible. In particular there is some interesting use of chasen in document clustering work done in Nara and elsewhere I seem to remember. Couldn't find the exact page but google will help you look at the field. Personally where I use these tools is in custom search engines I build, usually either completely in Perl or with plugins from projects like the above. They are mainly useful it seems in building an inverted index to search a lot of text quickly but I have a small (a few megabytes) Japanese database that works fine just with (Japanese) regexes.

I think it would be very interesting if Perl programmers could easily use state of the art computational linguistics or "A.I." algorithms (besides I guess what are already in perl) to make perl even more intelligent and perhaps automate some of the programming task. For example someone just gave me three nasty scripts to refactor together and update for 5.6.1, maybe perl could learn to tell me "Yep, those are real nasty scripts, better rewrite from scratch," or perhaps give me other insights into the code.

I am no a computational linguist, just interested. There is an awful lot of science there, so if anybody has insights about it please share with the rest of us.

In reply to Re: Re: Re: Perl and Linguistics by mattr
in thread Perl and Linguistics by doonyakka

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Do you know where your variables are?
	PerlMonks