in reply to Module namespace: OpenDirectory something?

As for the naming, I agree with ignatz - WWW::DMOZ is the best suggestion thus far.

However, what I really wanted to ask you is about the module - does it make live calls to DMOZ when you perform a search? I ask because a few years back I created a (unfortunately closed-source) search engine that indexed the DMOZ quasi-XML dumps and then answered search results with a hierarchical XML documents. Is this similar to your system? If so, did you find it to be as mosterous a pain in the ass as I did?

-sam

  • Comment on Re: Module namespace: OpenDirectory something?

Replies are listed 'Best First'.
Re: Re: Module namespace: OpenDirectory something?
by jplindstrom (Monsignor) on May 11, 2002 at 19:49 UTC
    I use the file structure.rdf.u8.gz available at http://dmoz.org/rdf.html to mine dmoz.org for links to harvest. The text + meta info + title of these links are gathered and analyzed for word frequency. This is the first pass. It takes a looong time.

    In the second pass, the key words minus stop words are connected to categories using a DB_File-tied hash. This takes a pretty long time.

    The third pass is the matching. This takes between one and maybe ten seconds depending on how much of the hash files are still in disk cache. It's a pretty naive way of performing the match, ideas and suggestions are welcome :)

    The XML parsing is home-grown, although there are modules for doing RDF stuff. It's not that difficult to get right anyway. The worst problems are a) spidering a million links takes time, and b) the dmoz editors keep changing the category structure all the time.

    /J

      This sounds very similar to my work, although I used the Glimpse full-text search engine instead of DB_Files to store the text index. This allows for much faster searches (0.1 to 1 seconds in searches that match less than 10% of all records). Glimpse also supports partial word matching, stemming and sports a broken regex implementation (woopee!). My system also supported boolean operators in complex searches including a peephole optimizer driven by the results of past searches.

      I remember that my biggest problems in the project were that the .u8 files aren't really in utf-8 - there's a ton of eastern-euro 8-bit encodings in there. So I had to do a number of passes just to clean the data. Second, I had to return complex result documents which meant building an index on the result elements as well as the source elements.

      End result: the company went out of business without ever managing to sell their product. My code went straight to /dev/null, as far as I know!

      -sam