isync has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks (and Monk Humphrey),

I have built an invindex of documents and now (just don't ask why..) I would like to iterate over *each* document stored in the index. Is that possible?

I would like to do it without hit-lists, so I don't get a "subset of documents" (which btw, accessing all docs this way is too slow), but a list of "all documents" to iterate over.

So far I poked around in (version 0.15 of) KinoSearch::Index::IndexReader, KinoSearch::Index::SegReader and KinoSearch::Document::Doc. Without luck. As it seems I am too unfamilar with the internals to get anywhere far.

Any kinosearch users around?



(BTW: Congrats on a great piece of code!)
  • Comment on KinoSearch - is there a way to iterate over all documents in an index?

Replies are listed 'Best First'.
Re: KinoSearch - is there a way to iterate over all documents in an index?
by samtregar (Abbot) on Sep 11, 2007 at 21:43 UTC
    Something here doesn't add up - you want to iterate through all the documents in an index, but you can't use the most obvious method because it's too slow? What makes you think this operation can be performed in a way that won't be slow? Data structures in general tend to be optimized for either random-access (think hashes) or iteration (think arrays). I bet KinoSearch is more like the former than the latter!

    -sam

      In the inverted index, there is a certain order of documents and in a results set, there is another one.

      Now, iterating over the index in the order of a results set means mapping the the order-of-relevance (from the results set) to the order-in-the-index (as sorted by KinoSearch). That involves quite a lot of repositioning of the read pointer (disk seeks) and slows down.

      Optimally, I would read doc after doc like they are stored in the invindex and not by an arbitrary order I get via a results-subset. That would cut the seeks and speed up.
      But as it seems nobody here (including me) knows how to access KinoSearch inverted indexes directly to interate over the docs in the order they are stored in the index.
Re: KinoSearch - is there a way to iterate over all documents in an index?
by snowhare (Friar) on Sep 12, 2007 at 12:35 UTC

    I've never used KinoSearch, so this is just based on the docs and looking at the code. It should work, but I haven't actually compiled or run it.

    The place to start is KinoSearch::Store::InvIndex. That is the abstract class the various stores are build around (I'm going to assume you are using KinoSearch::Store::FSInvIndex). You want the list and slurp_file methods.

    As per the docs for KinoSearch::Store::FSInvIndex and KinoSearch::Store::InvIndex:

    my $invindex = KinoSearch::Store::FSInvIndex->new( path => '/path/to/invindex', create => 0, ); my @list_of_filenames = $invindex->list; foreach my $filename (@list_of_filenames) { my $file_data = $invindex->slurp_file($filename); # Do stuff with data } $invindex->close;
      A big *thanks* for the effort.
      But it did not get me far, as my index won't fit into ram. So, since there's no method to handle big indexes, I am stuck. (is it true only the searcher can read large indexes chunk-wise?)
        Looking at the code, the KinoSearch::Store::FSInvIndex data store is just a simple directory filled with files. You should be able to open the directory and iterate over the files in it yourself. ;)
Re: KinoSearch - is there a way to iterate over all documents in an index?
by Anonymous Monk on Apr 17, 2008 at 07:06 UTC

    I know this is kinda old but I actually had occasion to do precisely what the author is asking here.

    In a nutshell...
    1. Get yourself an IndexReader (of some kind) by calling IndexReader->open on your inverted index (check the API). Call it $reader.
    2. for $i = 1 to $reader->num_docs(), call $reader->fetch_doc_vec($i). The return value is a DocVector.
    Here you run into trouble..