Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Moby Dick in 5 secs

by stabu (Scribe)
on Sep 22, 2004 at 09:58 UTC ( [id://392884]=perlmeditation: print w/replies, xml ) Need Help??

Hi!

On the subject of text mining, I'd like to draw your attention to this Infoworld article which features a text analysis solution for User Forums of all things, and claims to be able to "diagram" Moby Dick in 5 secs.

Sure, it smells a bit like FUD, but I would have thought it is also something well within Perl's reach.

So, is this old technology being dressed up for new needs? Or is it some sort of breakthrough?

Feelings, opinions, anyone?

Replies are listed 'Best First'.
Re: Moby Dick in 5 secs
by bwelch (Curate) on Sep 22, 2004 at 16:36 UTC
    I'm looking at similar ways to categorize and mine unstructured text. One commercial product we tried (none mentioned in the article) worked for smaller sets of data (under one million) and under 2000 categories. Beyond that is didn't scale well at all, as memory utilization grew considerably and CPU remained as the primary bottleneck. One job with 10,000 categories we started on a fairly new Sun box used 30GB RAM and was projected to take 570 days to finish!

    Using a linux cluster, this looks to be much more reasonable for significantly larger data sets. By giving sets of source articles to each cluster node and storing results in various database tables, the jobs of parsing, tagging, tokenizing, and categorizing may be divided into relatively independent jobs. The only dependency so far might be in database connections, but hopefully connection pooling can take care of that. One of these days I mean to write up the whole thing as a type of perl data mining RFC. (Sound useful?)

    Some of the ideas mentioned in the article are new to me. From the article:

    "PowerDrill takes the unstructured data, namely sentences, and diagrams the sentences placing each part of speech, such as noun phrase, verb phrase, and prepositional phrase, into a separate field, actor, action, and object which can then be used by a standard database to discover relationships and trends."

    So instead of, or maybe in addition to, using a term list, dictionary, and thesaurus, they are trying to extract knowledge based on sentence structure. That might be hard on some forums. Most teen forums I've seen involve many abbreviations, spelling errors, and slang. Some topic area forums aren't much better. That's a tough problem. In addition to that, written speech from books like Moby Dick is quite different from the written style used in most forums, just like written speech is very different from spoken speech.

    Adding this kind of knowledge extraction to standard categorization and parsing would be a neat thing. It does lead to other questions:

    • Where this is useful and what questions does it answer?
    • On what types of source data does this work well?
    • How scalable is their application as well as the technology in general?
    • What kind of hardware does it require?

    I would welcome stories from other monks about projects doing data mining, text processing, using clusters, or any related topics. I'm definitely still learning in this area, but with the contributions and help of people here we should be able to get a good handle on it all and explore interesting areas.

      I guess that a spacial analisys and basing on language's structure is a good starting point!

      Not in the case of foreigners trying to express themselves in English, though, because they might be applying their mother tongue structure as they write.

      Even if you use words that won't appear in dictionaries, as you are basing in your language's structure, (like in some child's game) it is like an equation that after a number of data it might be guessed.

      What is very far from good translations, is the way that these sentences might build up a logical thought. I imagine that such thing belongs to a more psychological structure. And it is very far from being understood by a machine. For example, a poem.

      .{\('v')/}
      _`(___)' __________________________
      Wherever I lay my KNOPPIX disk, a new FREE LINUX nation could be established.
Re: Moby Dick in 5 secs
by CloneArmyCommander (Friar) on Sep 22, 2004 at 12:34 UTC
    It would be neat to see thing like this done with perl. If it works, it proves people wrong who have negative opinions about scripting languages, it is does not work, then it was a learning experience. The only thing that I worry about with this technology is if my writing professor starts using it on my papers and realizes that I say little or nothing in so many words :).
Re: Moby Dick in 5 secs
by Anonymous Monk on Sep 22, 2004 at 15:25 UTC
    I work at that company and I can tell you that you could do what it does in perl, however I doubt you could make it perform at nearly the same speed. As it stands it is super optimized for speed because it is required to run over massive amounts of data in minimal time.
Re: Moby Dick in 5 secs
by fletcher_the_dog (Friar) on Sep 22, 2004 at 15:44 UTC
    There is a short video describing the technology here.
Re: Moby Dick in 5 secs
by SpanishInquisition (Pilgrim) on Sep 23, 2004 at 17:50 UTC
    I can do it in milliseconds:

    print "a few guitars\n15-20 minute drum solo\nthe end\n";

    Wrong Moby Dick?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://392884]
Approved by herveus
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (4)
As of 2024-03-29 00:51 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found