dmitri,

I've long wanted to do exactly what you've proposed, but just haven't found the cycles before now. I would be excited to collaborate with you on it.

As for hosting, for the time being I can run the app at rectangular.com... and maybe we could set up a repository at code.google.com? ;)

In addition to the indexer and search applications, we'll need a spidering app that pulls down a local copy of each PerlMonks node. tye has granted permission to spider the site, and suggested the PerlMonks XML node view for getting at the content (see What XML generators are currently available on PerlMonks? for info). Here's an XML rendering of your original post as an example.

In the initial pull, we'd iterate over each node numerically, probably saving individual XML files to the file system, 1000 nodes per directory. Some nodes will present problems — reaped nodes, for instance — but the responses will always contain sufficient information to dispatch sensibly.

Keeping the locally mirrored data up-to-date presents some problems, especially with regards to updated text and node rep fluctuations. These problems will be trivial to solve should the service move onto perlmonks.org directly; some of them are solveable even when running remotely, as the total volume of data is not very large. In any case, freshness issues will not have a major impact on the user experience and people will have no trouble making sensible comparisons between the old and the new.

Once we have a corpus, the indexing and search apps will present familiar challenges for us both. It will be fun to tinker with the ranking algorithms, and I expect that the extremely demanding user base will provide us with lots of high-quality feedback. :)

What say? Sound like a plan?

Cheers,

--
Marvin Humphrey
Rectangular Research ― http://www.rectangular.com

In reply to Re: Running SuperSearch off a fast full-text index. by creamygoodness
in thread Running SuperSearch off a fast full-text index. by dmitri

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.