philiph has asked for the wisdom of the Perl Monks concerning the following question:

I'm writing a script that archives mail messages for my entire site. Thus, the volume could potentially be pretty high. This script is called from a .forward file, one invocation per message. I do some processing on each message (mainly deciding whether or not to archive it). If I do archive the message, I write it to a file (one file per message). My question is what is the best way to avoid filename collisions? Right now I generate a name for each message file like this:
use Time::HiRes qw(gettimeofday); use POSIX qw(strftime); my $ArchiveDir = "/tmp/archive/"; ( -w $ArchiveDir ) or die "archive dir $ArchiveDir not writable"; # Get the number of microseconds. (undef, $usec) = gettimeofday; # Now create the archive filename and open that file. Use # sub-second resolution for name to make sure it's unique. $ArchiveFile = strftime "%Y%m%d%H%M%S", localtime; # make sure we get the full 6 digits on microseconds. $ArchiveFile .= "." . sprintf("%06d",$usec); # Combine $ArchiveDir with the filename. $ArchiveFile = $ArchiveDir . $ArchiveFile; # open the file. open(ArchiveFH,">$ArchiveFile") or die "failed to open ArchiveFile: $!";
So this creates files of the format 20050502133658.410018. Now to anticipate several questions: Now since I'm using microseconds here, it seems highly unlikely that two of my archive files could get the same name. But is it technically possible? What is the best way to generate unique files? Should I be using File::Temp? One annoyance with that is it enforces certain constraints on the filename.

Replies are listed 'Best First'.
Re: What's the best way to avoid name collisions when creating files?
by jeffa (Bishop) on May 02, 2005 at 20:29 UTC
      I concur. :)

      You want a truly unique idenfier? "Universally Unique Identifier" pretty much describes it.
        Although Data::UUID looks like an interesting solution, one big annoyance for me is it doesn't come standard in my Perl distribution (Fedora Core 1).
Re: What's the best way to avoid name collisions when creating files?
by suaveant (Parson) on May 02, 2005 at 20:00 UTC
    Well... there is always the old date,time and pid combination... that works fine as long as the script doesn't handle multiple messages in a loop. Of course, if it does handle messages in a loop it is easy to tell what the last name you used was and increment a counter if it is the same.

    There is always file locking with something like flock...

    And I believe you can also use sysopen to create files if they aren't there and error if they are, but not 100% sure on that... something with O_CREAT and O_EXCL maybe...

                    - Ant
                    - Some of my best work - (1 2 3)

      And I believe you can also use sysopen to create files if they aren't there and error if they are,

      You can, and O_CREAT and O_EXCL are exactly the flags the OP needs.

      C:\t>set DIRCMD=/b C:\t>dir newfile File Not Found C:\t>perl -MFcntl -e "sysopen F, 'newfile', O_EXCL | O_CREAT or die" C:\t>dir newfile newfile C:\t>perl -MFcntl -e "sysopen F, 'newfile', O_EXCL | O_CREAT or die" Died at -e line 1.

      There's a cavat to flock though. It doesn't work across the network . If you know that you're always going to use the local file system, then great. However if you move to a NAS, flock may stop working.

      I ran into this with DBD::CSV. DBD::CSV will use flock under the hood to ensure that it has exclusive access to the file it is reading/writing. However if the file is on a NAS (Network Area Storage) and accessed with NFS, then DBD::CSV will fail to open the file.

      Soon to be unemployed!
      Well... there is always the old date,time and pid combination... that works fine as long as the script doesn't handle multiple messages in a loop.
      Several people here have mentioned using the "pid", and I just wanted to make sure it's understood that the "$$" special variable always contains the process id. It's a very common idiom to name a temp file something like "/tmp/my_scripts_temp.$$", so that if you've got two instances of the same script running on the same machine at the same time they won't stomp on each other.

      Also, I've been known to do things like this:

      my $filename = "~/tmp/temp.$$"; while (-e $filename) { $filename .= '.' . chr(int(rand(25)) + 65); }
      though this, of course, presumes that you don't have to worry too much about race conditions.

Re: What's the best way to avoid name collisions when creating files?
by scmason (Monk) on May 02, 2005 at 21:19 UTC
    My first instinct would be to grab the md5 sum of the message (including headers). You should be pretty safe there. As mentioned above, no matter what method you should always try and detect filename collision and perhaps alter based on that. Most programs tend to change filename to filename-2 in the case of a collision.
      But surely there must be a relatively reliable way to create filenames that are guaranteed unique, right? I suppose it depends how reliable you want it to be. I like the md5sum idea or something similar. For example, could I use the Message-Id? That is guaranteed unique, right? I suppose there are messages out there that screw that one up (in particular, spam).

      Also I would have to analyze any computational approach (such as doing the md5sum) to make sure it didn't slow down the process too much. What if I had to save a 10MB message? What's the time required to run an md5sum on that?

        One of MD5's nice properties is that it's rediculously fast.

        One benchmark I've seen was something like 90 megabytes per second on a quite modest machine. MD4 is a little faster at around 100 megabytes per second, but most of the more cryptographically secure digests are slower, many MUCH slower.

        So much so that the recommendations are to keep using MD5 for non-sensitive stuff, despite recommendations to start moving away for signatures.
Re: What's the best way to avoid name collisions when creating files?
by Fletch (Bishop) on May 02, 2005 at 20:55 UTC

    Might take a look at how qmail's maildir format handles a similar application.

Re: What's the best way to avoid name collisions when creating files?
by Cody Pendant (Prior) on May 03, 2005 at 04:58 UTC
    I just use random strings, with this sub I found by searching PerlMonks:
    sub rndlc{local$"=''; "@{[map{chr(97+int rand 26)} 1 .. shift]}" }; my $filename = rndlc(10);
    You can always attach the random string at the end of a date-time string to get more human-friendly filenames.

    Of course there's no guarantee you won't get the same string twice, but the odds are 2610 against it...



    ($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss')
    =~y~b-v~a-z~s; print
      You are assuming that the random number generator's results are evenly distributed. You might like to check the quality of rand's results using Statistics::ChiSquare.
Re: What's the best way to avoid name collisions when creating files?
by inman (Curate) on May 03, 2005 at 09:23 UTC
    How about creating one directory per day with sub-directories (File::Path) that reflect the structure of the information that you are trying to archive (mailbox name etc.). You should be able to add the e-mails to this structure with much less chance of collision (using File::Temp to make sure). When your daily archiving task finishes, the directory can be zipped (Archive::Zip) for long term storage.

    Although, this isn't a substantial difference to the previous suggestions, the extra structure will benefit you in the long term. At some point, someone will want to retrieve an e-mail from the archive. Storing by date allows you to restrict the scope of the text searching that you do later.

Re: What's the best way to avoid name collisions when creating files?
by philiph (Acolyte) on May 04, 2005 at 12:41 UTC
    Thanks for all the excellent suggestions. Right now I'm using filenames of the format <unix time>.<pid>.<counter>. I'm opening the files in a loop with O_CREAT and O_EXCL and if the initial open fails I increment the counter and keep trying until it works. I think that will be more than adequate for my needs.
      And it is nice to have useful info in the filename, I use date time pid for my maildir stuff and it lets me easily identify message dates and times... actually been useful a couple of times for things like spam stats.

                      - Ant
                      - Some of my best work - (1 2 3)

Re: What's the best way to avoid name collisions when creating files?
by wizkid (Initiate) on May 02, 2005 at 22:39 UTC
    File::Temp is not an option for it tries to delete the file as soon as you dont need it anymore. I truly believe you will get along fine with a combination of time and pid.
      Huh?!?

      From File::Temp:

      $tmp = new File::Temp( UNLINK => 0, SUFFIX => '.dat' );

      not to mention several other options, like

      $unopened_file = mktemp( $template );
        ok, i forgot that ;)