spx2 has asked for the wisdom of the Perl Monks concerning the following question:

Keywords:SHA1,databases,Higher order perl,hashing,efficiency
Hello,

I have thought this a couple of time and now I can formulate what first
was just a small feeling

We are given the following:a database with a table with 3 columns=> (name_of_file,sha1_of_file,last_modified_time)
This database is built once using all the files on the disk and it takes some time.
Periodically one needs to update the database as some of the files might be modified,
but this is also a time consuming process,yes,it is less time consuming than building the
whole database from scratch again,but it is nonetheless time consuming.

So for each file it is necesary to get the file,compare its last_modified_time with the one in the
database and if they differ,update its checksum in the database(because we know it has been
modified).

However I felt that for small files a "strange" phenomena might take place
As they are small(we do not yet define small - we denote this unknown by (1)), the time it will take for it to
be hashed will also be small(we also do not have a clear definition,function,anything that
describes how the carachteristics of the file influence the hashing speed - we denote this unknown by (2)).
So the question is :
Is the time for hashing small files smaller than fetching them from database?
If this question is answered then for small enough files we can decide
to not search for them in the database and hash them directly without any other reasoning
(this is somewhat similar to the efficiency of the hashes described in Higher Order Perl),
and this would be a nice optimisation,I cannot measure if it would be big or small but
considering the disk has allot of files(order of magnitude 10^6)
We need to know in what conditions does this happen but we are unable to do so
because of (1),(2).
Any ideas or suggestions are very much appreciated. Thank you
Note:This is related to a previous node
  • Comment on the sands of time(in search of an optimisation)

Replies are listed 'Best First'.
Re: the sands of time(in search of an optimisation)
by BrowserUk (Patriarch) on Mar 04, 2008 at 03:14 UTC

    If you're dealing with Win32/NTFS systems, then there are a couple of vastly more efficient ways to achieve your aim.

    The first is to use the Change Notify mechanism (see Win32::ChangeNotify). This would involve creating a daemon or service to register your interest in changes to the file system. It runs perpectually in the background blocking on a change notification. You have it update the database as the changes occur. Very efficient, but requires a permently running process.

    The second is to use Change Journaling. Basically this involves asking the system to record all filesystem changes in a journal (file). You can then periodically retrieve those changes, update your DB and reset the journal.

    The following comes from (MS)Change jounals:

    An automatic backup application is one example of a program that must check for changes to the state of a volume to perform its task. The brute force method of checking for changes in directories or files is to scan the entire volume. However, this is often not an acceptable approach because of the decrease in system performance it would cause. Another method is for the application to register a directory notification (by calling the FindFirstChangeNotification or ReadDirectoryChangesW functions) for the directories to be backed up. This is more efficient than the first method, however, it requires that an application be running at all times. Also, if a large number of directories and files must be backed up, the amount of processing and memory overhead for such an application might also cause the operating system's performance to decrease.

    To avoid these disadvantages, the NTFS file system maintains a change journal. When any change is made to a file or directory in a volume, the change journal for that volume is updated with a description of the change and the name of the file or directory.

    The first method has a *nix equivalent mechanism. And I believe some *nix filesystems are capable of the second.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Hello,

      Thank you.
      I have talked with some people on IRC and they talk about a similar hooking of the systems low-level
      API to be able to be notified when a file has changed.
      In particular on Linux,I think I should hook the fclose function by writing a kernel module to
      "overwrite" it a user-defined one.
Re: the sands of time(in search of an optimisation)
by dragonchild (Archbishop) on Mar 04, 2008 at 01:57 UTC
    Define "takes some time". If it's something that can be done overnight and it takes less than 8 hours, you're fine. In other words, don't optimize without demonstrating a need to do so.

    My criteria for good software:
    1. Does it work?
    2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
      Hello,

      At the moment from some benchmarks the time it takes to build the database on 110.000 files
      is 110 minutes.
      It is arguable to say if it's taking some time or little time.
      My plan is to imagine this scaling on a server with say 100.000.000 files and do well on it.
      That is why I search for optimisations like this one.
      I have also made some benchmarks on the software , here they are.
        So, it takes you 1 minute for every 1000 files you work with. Alternately, you can process 17 files/second. To me, this means you're hitting the fundamental limits of Perl. Perl is, frankly, a very slow language from a CPU perspective. That's not what it was optimized for. It's been optimized for developer speed.

        So, I would put forward that you really have two options:

        • Rewrite in C - should give you a 10-1000x speed improvement.
        • Fork. A lot.
        I would try the forking option first. Look at Parallel::ForkManager. There's a number of ways you can iterate into the children, depending on how your directories and files are laid out. But, that's what I'd do first.

        My criteria for good software:
        1. Does it work?
        2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?