I'm going to work on an application to index the contents of documents in an archive. Most of these documents are pdf- a lot of them are paper document scans. ( So I am using ocr to get the text out- works well ).

The files can be small, or very large. And there really are a lot of them. Getting the data to index one item (getting the content and analizing it) is a cpu feast but light on memory. Each element can take minutes to analize.

In the past, with a situation where I just want some basic data, like mime type, location, etc.. I've made use of my Metadata::ByInode indexer. This is good in a situation when the index is remade at night, and you just redo the entire index. No big deal. It takes maybe 18 mins to index 20k documents with something like mime type, some filename things.. etc.

But for doing ocr on each document, where each can have hudnreds of pages, and there are .. well.. many thousands of documents.. Something else has to happen. The process has to be separated into steps- there has to be a QUEUE for files to be indexed, a QUEUE for entries that are obsolete and must be deleted.. The idexing must happen incrementally, the system must be able to crash without having to reindex all.. etc etc.

The first run will take days. That's fine. My concern is with the maintenance of the database.. data.

Option 1) I will need a maintenance procedure that will perhaps twice a day.. check that the data is still acurate. That the files are still there and have not changed. So one of my tables records some kind of filestate. This is what I am unsure of. Where to deem filestate (file gone or changed) changed or not. My first temptation is md5 sum. My tests show that it could take 7 minutes to get md5 sums for 200 large files on a x86_64 bit machine using Digest::MD5::File (memory is not an issue). I am thinking using md5 sum as the control to test filestate change is not going to be good on 20k, or 50k documents. Inode was my other temptation, but that does little for the actual data inside. Maybe a combination of inode and mtime- but this is only good for a file that stays inside a filesystem. These files may move around, be edited, who knows.

Option 2) Maybe I should just do what google does.. hold on to the data until it is revisited. And perpetually redindex everything? Sounds deceptively intuitive.

Any thoughts on how to keep track of filestate (the file being detected as changed, the data within it, that is) A link to where this is available already made and for download?


In reply to (OT) indexing pdf archive content in a multiuser environment- how do i know when the content changed? by leocharre

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.