I have a set of modules dealing with OpenDirectory (dmoz.org) categories, and I'd like some help finding a proper name space for them before going near CPAN.

Given a text or a URL, you can get an ordered list of dnoz categories that match. Hence, the text or URL is categorized. There are currently two ways of performing the match.

An example:

http://www.asktheheadhunter.com/

Top/Business/ (139)
  Employment/Careers (45)
  Employment/Job_Search/Interview_Advice (37)
  Employment/Job_Search (31)
  Industries/Arts_and_Entertainment/Sports/Employment (26)
Top/Society/ (53)
  Advice/Web_Columns (27)
  Law/Products/Self-Help/Family_Law/Name_Change_Kits (26)
Top/Arts/ (52)
  Music/Bands_and_Artists/T/Twisted_Sister/Ojeda,_Eddie (26)
  Animation/Voice_Actors/Information/FAQs (26)
Top/Recreation/ (27)
  Humor/Advice (27)
Top/Computers/ (26)
  Education/Hardware/Hardware_Courses (26)

The module names I use right now are:

OpenDirectory
OpenDirectory::Category
OpenDirectory::Matcher
OpenDirectory::Matcher::Word
OpenDirectory::Matcher::Google
OpenDirectory::MatchResult
...

I guess the best thing would be to find a proper top level name to put it in. Suggestions? Text:: perhaps? Or something with classification or taxonomy? I'm kind of lost here.

Current modules on CPAN having to do with dmoz:

Both of these seems pretty much unrelated to this module.

/J

  • Comment on Module namespace: OpenDirectory something?

Replies are listed 'Best First'.
Re: Module namespace: OpenDirectory something?
by ignatz (Vicar) on May 11, 2002 at 18:09 UTC
    How about WWW::DMOZ since it does directly relate to the web? DMOZ is a lot more descriptive to me than OpenDirectory (I always have to remind myself as to what that is) and it's has the added bonus of being shorter.

    This sounds really useful, BTW.

    ()-()
     \"/
      `                                                     
    
      The WWW:: prefix might be a good choise, but the more I use the OpenDirectory data, the less I think of it as "a web site" :) And while the data used is mined from the web, the matcher doesn't use the web while performing the match.

      However, most people probably think of dmoz.org as "just a web site" and the name should be set to match their expectations. And the _contents_ of the OpenDirectory is obviously WWW-related.

      The WWW::Search:: namespace seems a little off. So far as i can see, _all_ the modules below WWW::Search:: are subclassed to call the respective search engine. This one doesn't. Besides there is already a WWW::Search::OpenDirectory :)

      /J - grateful for the feedback

        I think part of it results from not wanting to type 'OpenDirectory' when a four-letter alternative exists.

        The WWW::Search:: namespace seems a little off. So far as i can see, _all_ the modules below WWW::Search:: are subclassed to call the respective search engine. This one doesn't. Besides there is already a WWW::Search::OpenDirectory :)

        Perhaps WWW::DMOZ will help to distinguish it in the human memory space, then. :)

        -----------------------
        You are what you think.

Re: Module namespace: OpenDirectory something?
by samtregar (Abbot) on May 11, 2002 at 19:30 UTC
    As for the naming, I agree with ignatz - WWW::DMOZ is the best suggestion thus far.

    However, what I really wanted to ask you is about the module - does it make live calls to DMOZ when you perform a search? I ask because a few years back I created a (unfortunately closed-source) search engine that indexed the DMOZ quasi-XML dumps and then answered search results with a hierarchical XML documents. Is this similar to your system? If so, did you find it to be as mosterous a pain in the ass as I did?

    -sam

      I use the file structure.rdf.u8.gz available at http://dmoz.org/rdf.html to mine dmoz.org for links to harvest. The text + meta info + title of these links are gathered and analyzed for word frequency. This is the first pass. It takes a looong time.

      In the second pass, the key words minus stop words are connected to categories using a DB_File-tied hash. This takes a pretty long time.

      The third pass is the matching. This takes between one and maybe ten seconds depending on how much of the hash files are still in disk cache. It's a pretty naive way of performing the match, ideas and suggestions are welcome :)

      The XML parsing is home-grown, although there are modules for doing RDF stuff. It's not that difficult to get right anyway. The worst problems are a) spidering a million links takes time, and b) the dmoz editors keep changing the category structure all the time.

      /J

        This sounds very similar to my work, although I used the Glimpse full-text search engine instead of DB_Files to store the text index. This allows for much faster searches (0.1 to 1 seconds in searches that match less than 10% of all records). Glimpse also supports partial word matching, stemming and sports a broken regex implementation (woopee!). My system also supported boolean operators in complex searches including a peephole optimizer driven by the results of past searches.

        I remember that my biggest problems in the project were that the .u8 files aren't really in utf-8 - there's a ton of eastern-euro 8-bit encodings in there. So I had to do a number of passes just to clean the data. Second, I had to return complex result documents which meant building an index on the result elements as well as the source elements.

        End result: the company went out of business without ever managing to sell their product. My code went straight to /dev/null, as far as I know!

        -sam

Re: Module namespace: OpenDirectory something?
by belg4mit (Prior) on May 11, 2002 at 18:04 UTC
    Module relationships are less important than namespace hierarchy. I think fitting them into WWW::Search would be best.

    --
    perl -pew "s/\b;([mnst])/'$1/g"

      WWW::Search is a base class to a whole family of module frontends to search engines. This doesn't seem to fit into that, categories not being the same thing as a search result and all.

      Added after reply: agreed

      ()-()
       \"/
        `                                                     
      
        Categories are not "directly", but they are associated with it. And"::Matcher" and "::MatchResult". sure sound like they belong there.

        --
        perl -pew "s/\b;([mnst])/'$1/g"