KinoSearch - is there a way to iterate over all documents in an index?

isync has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: KinoSearch - is there a way to iterate over all documents in an index? by samtregar (Abbot) on Sep 11, 2007 at 21:43 UTC
Something here doesn't add up - you want to iterate through all the documents in an index, but you can't use the most obvious method because it's too slow? What makes you think this operation can be performed in a way that won't be slow? Data structures in general tend to be optimized for either random-access (think hashes) or iteration (think arrays). I bet KinoSearch is more like the former than the latter! -sam	[reply]
Re^2: KinoSearch - is there a way to iterate over all documents in an index? by isync (Hermit) on Sep 12, 2007 at 11:23 UTC
In the inverted index, there is a certain order of documents and in a results set, there is another one. Now, iterating over the index in the order of a results set means mapping the the order-of-relevance (from the results set) to the order-in-the-index (as sorted by KinoSearch). That involves quite a lot of repositioning of the read pointer (disk seeks) and slows down. Optimally, I would read doc after doc like they are stored in the invindex and not by an arbitrary order I get via a results-subset. That would cut the seeks and speed up. But as it seems nobody here (including me) knows how to access KinoSearch inverted indexes directly to interate over the docs in the order they are stored in the index.	[reply]
Re: KinoSearch - is there a way to iterate over all documents in an index? by snowhare (Friar) on Sep 12, 2007 at 12:35 UTC
I've never used KinoSearch, so this is just based on the docs and looking at the code. It should work, but I haven't actually compiled or run it. The place to start is KinoSearch::Store::InvIndex. That is the abstract class the various stores are build around (I'm going to assume you are using KinoSearch::Store::FSInvIndex). You want the `list` and `slurp_file` methods. As per the docs for KinoSearch::Store::FSInvIndex and KinoSearch::Store::InvIndex: `my $invindex = KinoSearch::Store::FSInvIndex->new( path => '/path/to/invindex', create => 0, ); my @list_of_filenames = $invindex->list; foreach my $filename (@list_of_filenames) { my $file_data = $invindex->slurp_file($filename); # Do stuff with data } $invindex->close;` [download]	[reply] [d/l] [select]
Re^2: KinoSearch - is there a way to iterate over all documents in an index? by isync (Hermit) on Sep 12, 2007 at 16:28 UTC
A big thanks for the effort. But it did not get me far, as my index won't fit into ram. So, since there's no method to handle big indexes, I am stuck. (is it true only the searcher can read large indexes chunk-wise?)	[reply]
Re^3: KinoSearch - is there a way to iterate over all documents in an index? by snowhare (Friar) on Sep 13, 2007 at 00:49 UTC
Looking at the code, the KinoSearch::Store::FSInvIndex data store is just a simple directory filled with files. You should be able to open the directory and iterate over the files in it yourself. ;)	[reply]
Re: KinoSearch - is there a way to iterate over all documents in an index? by Anonymous Monk on Apr 17, 2008 at 07:06 UTC
I know this is kinda old but I actually had occasion to do precisely what the author is asking here. In a nutshell... Get yourself an IndexReader (of some kind) by calling IndexReader->open on your inverted index (check the API). Call it $reader. for $i = 1 to $reader->num_docs(), call $reader->fetch_doc_vec($i). The return value is a DocVector. Here you run into trouble.. Read more... (2 kB)	[reply] [d/l]