What's the best way to avoid name collisions when creating files?

philiph has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: What's the best way to avoid name collisions when creating files? by jeffa (Bishop) on May 02, 2005 at 20:29 UTC
How about Data::UUID? It was featured for the 10th day of last year's Perl Advent Calendar: http://perladvent.org/2004/10th. jeffa L-LL-L--L-LL-L--L-LL-L-- -R--R-RR-R--R-RR-R--R-RR B--B--B--B--B--B--B--B-- H---H---H---H---H---H--- (the triplet paradiddle with high-hat)	[reply]
Re^2: What's the best way to avoid name collisions when creating files? by adamk (Chaplain) on May 03, 2005 at 12:02 UTC
I concur. :) You want a truly unique idenfier? "Universally Unique Identifier" pretty much describes it.	[reply]
Re^3: What's the best way to avoid name collisions when creating files? by philiph (Acolyte) on May 03, 2005 at 13:36 UTC
Although Data::UUID looks like an interesting solution, one big annoyance for me is it doesn't come standard in my Perl distribution (Fedora Core 1).	[reply]
Re^4: What's the best way to avoid name collisions when creating files? by adamk (Chaplain) on May 03, 2005 at 15:16 UTC
Re^5: What's the best way to avoid name collisions when creating files? by philiph (Acolyte) on May 03, 2005 at 15:42 UTC
Re^4: What's the best way to avoid name collisions when creating files? by dragonchild (Archbishop) on May 03, 2005 at 13:49 UTC
Re^5: What's the best way to avoid name collisions when creating files? by halley (Prior) on May 03, 2005 at 15:59 UTC
Re: What's the best way to avoid name collisions when creating files? by suaveant (Parson) on May 02, 2005 at 20:00 UTC
Well... there is always the old date,time and pid combination... that works fine as long as the script doesn't handle multiple messages in a loop. Of course, if it does handle messages in a loop it is easy to tell what the last name you used was and increment a counter if it is the same. There is always file locking with something like flock... And I believe you can also use sysopen to create files if they aren't there and error if they are, but not 100% sure on that... something with O_CREAT and O_EXCL maybe... - Ant - Some of my best work - (1 2 3)	[reply]
Re^2: What's the best way to avoid name collisions when creating files? by bmann (Priest) on May 02, 2005 at 20:31 UTC
And I believe you can also use sysopen to create files if they aren't there and error if they are, You can, and O_CREAT and O_EXCL are exactly the flags the OP needs. `C:\t>set DIRCMD=/b C:\t>dir newfile File Not Found C:\t>perl -MFcntl -e "sysopen F, 'newfile', O_EXCL \| O_CREAT or die" C:\t>dir newfile newfile C:\t>perl -MFcntl -e "sysopen F, 'newfile', O_EXCL \| O_CREAT or die" Died at -e line 1.` [download]	[reply] [d/l]
Re^2: What's the best way to avoid name collisions when creating files? by osunderdog (Deacon) on May 02, 2005 at 21:11 UTC
There's a cavat to `flock` though. It doesn't work across the network . If you know that you're always going to use the local file system, then great. However if you move to a NAS, flock may stop working. I ran into this with `DBD::CSV`. `DBD::CSV` will use flock under the hood to ensure that it has exclusive access to the file it is reading/writing. However if the file is on a NAS (Network Area Storage) and accessed with NFS, then `DBD::CSV` will fail to open the file. Soon to be unemployed!	[reply] [d/l] [select]
Re^2: What's the best way to avoid name collisions when creating files? by doom (Deacon) on May 03, 2005 at 21:20 UTC
Well... there is always the old date,time and pid combination... that works fine as long as the script doesn't handle multiple messages in a loop. Several people here have mentioned using the "pid", and I just wanted to make sure it's understood that the "$$" special variable always contains the process id. It's a very common idiom to name a temp file something like "/tmp/my_scripts_temp.$$", so that if you've got two instances of the same script running on the same machine at the same time they won't stomp on each other. Also, I've been known to do things like this: `my $filename = "~/tmp/temp.$$"; while (-e $filename) { $filename .= '.' . chr(int(rand(25)) + 65); }` [download] though this, of course, presumes that you don't have to worry too much about race conditions.	[reply] [d/l]
Re: What's the best way to avoid name collisions when creating files? by scmason (Monk) on May 02, 2005 at 21:19 UTC
My first instinct would be to grab the md5 sum of the message (including headers). You should be pretty safe there. As mentioned above, no matter what method you should always try and detect filename collision and perhaps alter based on that. Most programs tend to change filename to filename-2 in the case of a collision.	[reply]
Re^2: What's the best way to avoid name collisions when creating files? by philiph (Acolyte) on May 03, 2005 at 13:40 UTC
But surely there must be a relatively reliable way to create filenames that are guaranteed unique, right? I suppose it depends how reliable you want it to be. I like the md5sum idea or something similar. For example, could I use the Message-Id? That is guaranteed unique, right? I suppose there are messages out there that screw that one up (in particular, spam). Also I would have to analyze any computational approach (such as doing the md5sum) to make sure it didn't slow down the process too much. What if I had to save a 10MB message? What's the time required to run an md5sum on that?	[reply]
Re^3: What's the best way to avoid name collisions when creating files? by adamk (Chaplain) on May 03, 2005 at 15:21 UTC
One of MD5's nice properties is that it's rediculously fast. One benchmark I've seen was something like 90 megabytes per second on a quite modest machine. MD4 is a little faster at around 100 megabytes per second, but most of the more cryptographically secure digests are slower, many MUCH slower. So much so that the recommendations are to keep using MD5 for non-sensitive stuff, despite recommendations to start moving away for signatures.	[reply]
Re: What's the best way to avoid name collisions when creating files? by Fletch (Bishop) on May 02, 2005 at 20:55 UTC
Might take a look at how qmail's maildir format handles a similar application.	[reply]
Re: What's the best way to avoid name collisions when creating files? by Cody Pendant (Prior) on May 03, 2005 at 04:58 UTC
I just use random strings, with this sub I found by searching PerlMonks: `sub rndlc{local$"=''; "@{[map{chr(97+int rand 26)} 1 .. shift]}" }; my $filename = rndlc(10);` [download] You can always attach the random string at the end of a date-time string to get more human-friendly filenames. Of course there's no guarantee you won't get the same string twice, but the odds are 26¹⁰ against it... ($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss') =~y~b-v~a-z~s; print	[reply] [d/l]
Re^2: What's the best way to avoid name collisions when creating files? by DrHyde (Prior) on May 03, 2005 at 09:03 UTC
You are assuming that the random number generator's results are evenly distributed. You might like to check the quality of `rand`'s results using Statistics::ChiSquare.	[reply] [d/l]
Re: What's the best way to avoid name collisions when creating files? by inman (Curate) on May 03, 2005 at 09:23 UTC
How about creating one directory per day with sub-directories (File::Path) that reflect the structure of the information that you are trying to archive (mailbox name etc.). You should be able to add the e-mails to this structure with much less chance of collision (using File::Temp to make sure). When your daily archiving task finishes, the directory can be zipped (Archive::Zip) for long term storage. Although, this isn't a substantial difference to the previous suggestions, the extra structure will benefit you in the long term. At some point, someone will want to retrieve an e-mail from the archive. Storing by date allows you to restrict the scope of the text searching that you do later.	[reply]
Re: What's the best way to avoid name collisions when creating files? by philiph (Acolyte) on May 04, 2005 at 12:41 UTC
Thanks for all the excellent suggestions. Right now I'm using filenames of the format <unix time>.<pid>.<counter>. I'm opening the files in a loop with O_CREAT and O_EXCL and if the initial open fails I increment the counter and keep trying until it works. I think that will be more than adequate for my needs.	[reply]
Re^2: What's the best way to avoid name collisions when creating files? by suaveant (Parson) on May 04, 2005 at 14:25 UTC
And it is nice to have useful info in the filename, I use date time pid for my maildir stuff and it lets me easily identify message dates and times... actually been useful a couple of times for things like spam stats. - Ant - Some of my best work - (1 2 3)	[reply]
Re: What's the best way to avoid name collisions when creating files? by wizkid (Initiate) on May 02, 2005 at 22:39 UTC
File::Temp is not an option for it tries to delete the file as soon as you dont need it anymore. I truly believe you will get along fine with a combination of time and pid.	[reply]
Re^2: What's the best way to avoid name collisions when creating files? by blazar (Canon) on May 03, 2005 at 08:14 UTC
Huh?!? From File::Temp: `$tmp = new File::Temp( UNLINK => 0, SUFFIX => '.dat' );` [download] not to mention several other options, like `$unopened_file = mktemp( $template );` [download]	[reply] [d/l] [select]
Re^3: What's the best way to avoid name collisions when creating files? by wizkid (Initiate) on May 05, 2005 at 09:07 UTC
ok, i forgot that ;)	[reply]