leocharre has asked for the wisdom of the Perl Monks concerning the following question:

I'm going to work on an application to index the contents of documents in an archive. Most of these documents are pdf- a lot of them are paper document scans. ( So I am using ocr to get the text out- works well ).

The files can be small, or very large. And there really are a lot of them. Getting the data to index one item (getting the content and analizing it) is a cpu feast but light on memory. Each element can take minutes to analize.

In the past, with a situation where I just want some basic data, like mime type, location, etc.. I've made use of my Metadata::ByInode indexer. This is good in a situation when the index is remade at night, and you just redo the entire index. No big deal. It takes maybe 18 mins to index 20k documents with something like mime type, some filename things.. etc.

But for doing ocr on each document, where each can have hudnreds of pages, and there are .. well.. many thousands of documents.. Something else has to happen. The process has to be separated into steps- there has to be a QUEUE for files to be indexed, a QUEUE for entries that are obsolete and must be deleted.. The idexing must happen incrementally, the system must be able to crash without having to reindex all.. etc etc.

The first run will take days. That's fine. My concern is with the maintenance of the database.. data.

Option 1) I will need a maintenance procedure that will perhaps twice a day.. check that the data is still acurate. That the files are still there and have not changed. So one of my tables records some kind of filestate. This is what I am unsure of. Where to deem filestate (file gone or changed) changed or not. My first temptation is md5 sum. My tests show that it could take 7 minutes to get md5 sums for 200 large files on a x86_64 bit machine using Digest::MD5::File (memory is not an issue). I am thinking using md5 sum as the control to test filestate change is not going to be good on 20k, or 50k documents. Inode was my other temptation, but that does little for the actual data inside. Maybe a combination of inode and mtime- but this is only good for a file that stays inside a filesystem. These files may move around, be edited, who knows.

Option 2) Maybe I should just do what google does.. hold on to the data until it is revisited. And perpetually redindex everything? Sounds deceptively intuitive.

Any thoughts on how to keep track of filestate (the file being detected as changed, the data within it, that is) A link to where this is available already made and for download?

  • Comment on (OT) indexing pdf archive content in a multiuser environment- how do i know when the content changed?

Replies are listed 'Best First'.
Re: (OT) indexing pdf archive content in a multiuser environment- how do i know when the content changed?
by Joost (Canon) on May 25, 2007 at 23:22 UTC
    A good and reasonably fast checksum like MD5 sounds like a good idea.

    But first, I would just check the files' modification times. (see -M and possibly -C) Depending on the OS and filesystem, it can be safe to assume that a file hasn't changed if either none of these have changed, at least not if you check only once a day (since the change time might only have a one second resolution).

    Checking the -C and -M attributes will be a lot faster than checking the file's data contents for large files.

      You probably not only want to check file modification time but also make sure that the file is not currently in use being modified.
      Usually it is sufficient to leave out files that have changed in the last minute or so. If you need to be absolutely sure that the file is not in use then use a call to *nx command lsof to list open files (needs root privs to work system wide).

      That is actually very helpful to me. I can go once over and test mtime, queue those that have diff modify times and test those for md5sum.. very interesting..

      Of course there *is* a remote possiblity that an absolute path will be used twice, and their modify times would be the same- yet the data would be different. It *is* possible.

      Still.. I like. The only thing I can think of that would go around all these problems would be to mess with the guts of the filesystem itself. Which I am almost tempted to learn more about - someday..

      Thank you. This is helpful.

        The only time the mtime AND ctime will be the same but the contents has changed is if the file has changed within the same time (the same second, or whatever is the resolution the file system uses) as the last check. You can safely ignore anything that hasn't changed for 2 checks.

        update: slightly stronger wording: ctime AND mtime

Re: (OT) indexing pdf archive content in a multiuser environment- how do i know when the content changed?
by holli (Abbot) on May 26, 2007 at 19:34 UTC
    Inode was my other temptation, but that does little for the actual data inside. Maybe a combination of inode and mtime- but this is only good for a file that stays inside a filesystem. These files may move around, be edited, who knows.
    Just to add another perspective, how about getting rid of the filesystem and use some kind of Concurrent_Versions_System? Perhaps a web interface where users can download and upload documents? You could then, while you're at hit, use your "some basic data" and provide your users a search functionality.

    Update:
    If you decide to go the filesystem route, you may find File::Monitor helpful.


    holli, /regexed monk/