efficient way to find most recently modified files?

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: efficient way to find most recently modified files? by davido (Cardinal) on May 08, 2006 at 06:19 UTC
Just keep a separate database that tracks the most recent upload per user. Whenever bobby uploads a file, its filename replaces whatever filename was in bobby's database row previously. And whenever someone accesses bobby's page, the database is checked to see which file was most recently uploaded by bobby. It's not a good solution to sort a list only to obtain the single newest. It's a better solution to either keep track of which one is newest, or do a linear search for the newest. In this particular case, I think you're just better off keeping track from the outset so you never have to search through 500,000 files. For simple database solutions, you could have a look at DBD::SQLite, and of course, DBI. Or perhaps your web server provider already has some other database installed that they'll let you use too. Dave	[reply]
Re: efficient way to find most recently modified files? by Zaxo (Archbishop) on May 08, 2006 at 06:20 UTC
If you're concerned about stuffed directories producing huge in-memory lists or arrays, opendir will give you a handle you can read one name at a time. Since you are looking for maximum times, you can keep memory requirements low and independent of population. Your requirement for a creation time is tougher because it's not a portable statistic. Unix filesystems don't have any such thing. Think about that requirement, though. Is the intent to advertise newest content? Then an edited file ought to be as good as a newly uploaded one. In that case, mtime is what you want, and it is quite portable. System `touch` and Perl utime can alter file times, and `touch` can create empty files. Filesystem time stamps should not be regarded as definitive of anything. You should take a look at File::Find. It's not the easiest thing to use, but it can handle any problem of this kind. Update: ++davido's solution is best if you know that all uploads come in through your script. With shell or ftp access in the mix, you'd need to do a lot of maintainance in cron jobs. After Compline, Zaxo	[reply]
Re: efficient way to find most recently modified files? by jonadab (Parson) on May 08, 2006 at 11:35 UTC
Hello, just wondering, if you read a directory with, lets say 500,000+ files.....what happens? Will it take forever How long it takes depends somewhat on the filesystem. If it's smbfs, for instance, it'll take longer than you like. One thing you could do to make it more efficient is to put each user's files in their own directory, so that you'd have bobby/2006-05-08.txt instead of bobby2006-05-08.txt. However, if looking for the latest file for a certain user is a frequent operation, then you should do as davido suggests and keep track of that explicitely. (You may not need a database for that, though; for instance, you could just have the uploader make susan-latest.txt a symlink to susan's last uploaded file. If the filesystem you are using does not support symbolic links, then you could just store the filename of george's latest file in george-latest-upload-filename.txt or somesuch.) Sanity? Oh, yeah, I've got all kinds of sanity. In fact, I've developed whole new kinds of sanity. Why, I've got so much sanity it's driving me crazy.	[reply]
Re: efficient way to find most recently modified files? by BrowserUk (Patriarch) on May 08, 2006 at 14:14 UTC
Whatever filesystem you are on, storing 500,000+ files in a single directory is a really bad idea if you are regularly going to need access to some small subset of those files from short runtime scripts (eg. cgi scripts). Given that you have a nice regular format for the filenames, it would be much better to seggregate the files into smaller subsets by using some part of that filename as a directory name. Example: You might use `.\bobby\bobby2006-05-08.txt` [download] or `.\2006\05\08\bobby2006-05-08.txt` [download] Which of these schemes make most sense for your application will depend upon the details of the usage patterns, but the basic idea is to allow you to reduce the search space by going directly to some subset of files quickly. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re: efficient way to find most recently modified files? by mildside (Friar) on May 08, 2006 at 05:23 UTC
Please don't be offended, but is this homework? Since at this stage I suspect it might be, I won't go into too much detail, but I suggest you look at glob and sort to start with. mildside	[reply]
Re^2: efficient way to find most recently modified files? by Anonymous Monk on May 08, 2006 at 06:02 UTC
lol, not homework...i just really want to do this for my little "member blog" site. just building stuff out of curiosity, which is how i sorta learn.	[reply]
Re^3: efficient way to find most recently modified files? by mildside (Friar) on May 08, 2006 at 06:13 UTC
OK, this is untested, but try something like: `my @files_sorted = sort { substr($a,-14,10) cmp substr($b,-14,10) } (g +lob('yourpath/bobby*.txt')); my $latest_file = $files_sorted[-1];` [download] This assumes that all '.txt' files have the name format you mention. Cheers mildside	[reply] [d/l]
Re^4: efficient way to find most recently modified files? by mildside (Friar) on May 08, 2006 at 06:24 UTC
Re: efficient way to find most recently modified files? by jwkrahn (Abbot) on May 08, 2006 at 11:08 UTC
Assuming that you only want files with the date embedded in the name you should be able to do something like this: `my %file = ( name => 'name', date => '0000-00-00' ); my $dir = 'somedir'; opendir DIR, $dir or die "Cannot open '$dir' $!"; while ( my $file = readdir DIR ) { next unless $file =~ /(\d{4}-\d\d-\d\d)\.txt$/; if ( $1 gt $file{ date } ) { @file{ 'name', 'date' } = ( $file, $1 ); } } closedir DIR; print "Oldest file is $file{name}.\n" if $file{ date } ne '0000-00-00' +;` [download]	[reply] [d/l]
Re: efficient way to find most recently modified files? by Anonymous Monk on May 08, 2006 at 15:59 UTC
hello, <br I read some of the suggestions. thanks for all the replies! For davido's, it is not possible to always keep track of the most recent uploaded blog since I intend in the future that some of the users upload via FTP and not just through the web. sorry for not mentioning this in my first post. And for sub directories for each member, that solution can work but will make an extra step for the uploader. Its the uploaders job to download each text file from about 5-10 users everyday 1-2 days and then upload all the text files to a directory. If i would make it a sub-directory, the uploader will always have to navigate to /bobby/, /amanda/ etc and upload each text file individually as opposed to doing the job in one step by uploading all 5-10 text files in one directory. For the 500,000+....thats just my curiosity. the application i m building now will probably never reach the case it reads a directory with 500,000+ files, but if it ever does come to the point where i will be building one, then i can keep it in mind. Thanks! nick	[reply]
Re^2: efficient way to find most recently modified files? by BrowserUk (Patriarch) on May 08, 2006 at 18:25 UTC
And for sub directories for each member, that solution can work but will make an extra step for the uploader. A simple solution to that would be to have the files uploaded to a different directory and have a background script that monitors the upload directory and moves the files into the appropriate directory structure as they appear. The only complication is ensuring that the file is fully uploaded before you attempt to move it. If you can open the file for exclusive access that's not a problem. If not, you might have to retain a list of files in the directory along with their sizes and only move them once the size hasn't changed for a minute or two. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re: efficient way to find most recently modified files? by TedPride (Priest) on May 08, 2006 at 21:41 UTC
Just upload the files to the proper user directory as they come in. Stuffing everything into one directory is messy and decreases file system efficiency for large numbers of files (since it has to find inside ux files rather than u directories and then x files). Plus, depending on your file system, you may not be able to handle more than a certain number of files per directory anyway. Or if you have to work in batch, just move the files from /userdate.txt to /user/date.txt.	[reply]