in reply to Super fast file creation needed

  1. What constitutes a duplicate? It seems to me that each line would have to be unique due to the timestamps and chronological nature of logs files. (Perhaps these are overlapping logfiles?)

  2. If the logfiles are overlapping, then instead of creating all the files, why not just calculate a checksum for each line and then stuff that into a hash? You could then check for the checksum's existence to see if that line has already been seen.

I apologize in advance if this is an oversimplification of the problem. Perhaps you can provide more detail?


Where do you want *them* to go today?

Replies are listed 'Best First'.
Re^2: Super fast file creation needed
by dsheroh (Monsignor) on Oct 19, 2007 at 04:46 UTC
    For #2, I don't really see the point of hashing the lines (with a checksum) and then storing them in a hash (which will re-hash the hashed values). The only reason I can think of would be an attempt to speed lookups by using the checksum as a shorter hash key, but I would expect the extra time spent computing the checksums to overshadow any gains in lookup time. And then there's also the question of possible hash collisions in the checksums, which means more wasted time on redundantly handling that (since Perl hashes already have collision handling for their hashed keys).

    Just using the log lines as your hash keys directly seems simpler, faster, and more reliable, unless I'm missing something here. Am I?

      I understood that the logfiles were huge, which IMHO, makes storing the entire lines as hash keys impractical due to memory considerations.

      Sure, computing checksums/digests might slow things down some, but it is one way to identify whether a line has been seen or not. With the proper digest length, hash key collisions could be virtually eliminated.

      In this case, I think the memory considerations outweigh the speed considerations, but it would certainly be prudent to benchmark both ways to see which one works better.


      Where do you want *them* to go today?
        Ah, OK. Fair enough. Thanks for clearing that up!