Re: efficient way to find most recently modified files?
by davido (Cardinal) on May 08, 2006 at 06:19 UTC
|
Just keep a separate database that tracks the most recent upload per user. Whenever bobby uploads a file, its filename replaces whatever filename was in bobby's database row previously. And whenever someone accesses bobby's page, the database is checked to see which file was most recently uploaded by bobby.
It's not a good solution to sort a list only to obtain the single newest. It's a better solution to either keep track of which one is newest, or do a linear search for the newest. In this particular case, I think you're just better off keeping track from the outset so you never have to search through 500,000 files.
For simple database solutions, you could have a look at DBD::SQLite, and of course, DBI. Or perhaps your web server provider already has some other database installed that they'll let you use too.
| [reply] |
Re: efficient way to find most recently modified files?
by Zaxo (Archbishop) on May 08, 2006 at 06:20 UTC
|
If you're concerned about stuffed directories producing huge in-memory lists or arrays, opendir will give you a handle you can read one name at a time. Since you are looking for maximum times, you can keep memory requirements low and independent of population.
Your requirement for a creation time is tougher because it's not a portable statistic. Unix filesystems don't have any such thing. Think about that requirement, though. Is the intent to advertise newest content? Then an edited file ought to be as good as a newly uploaded one. In that case, mtime is what you want, and it is quite portable.
System touch and Perl utime can alter file times, and touch can create empty files. Filesystem time stamps should not be regarded as definitive of anything.
You should take a look at File::Find. It's not the easiest thing to use, but it can handle any problem of this kind.
Update: ++davido's solution is best if you know that all uploads come in through your script. With shell or ftp access in the mix, you'd need to do a lot of maintainance in cron jobs.
| [reply] |
Re: efficient way to find most recently modified files?
by jonadab (Parson) on May 08, 2006 at 11:35 UTC
|
Hello, just wondering, if you read a directory with, lets say 500,000+ files.....what happens? Will it take forever
How long it takes depends somewhat on the filesystem.
If it's smbfs, for instance, it'll take longer than you like.
One thing you could do to make it more efficient is to put
each user's files in their own directory, so that you'd
have bobby/2006-05-08.txt instead of bobby2006-05-08.txt.
However, if looking for the latest file for a certain user
is a frequent operation, then you should do as davido
suggests and keep track of that explicitely. (You may
not need a database for that, though; for instance, you
could just have the uploader make susan-latest.txt a symlink
to susan's last uploaded file. If the filesystem you are
using does not support symbolic links, then you could
just store the filename of george's latest file in
george-latest-upload-filename.txt or somesuch.)
Sanity? Oh, yeah, I've got all kinds of sanity. In fact, I've developed whole new kinds of sanity. Why, I've got so much sanity it's driving me crazy.
| [reply] |
Re: efficient way to find most recently modified files?
by BrowserUk (Patriarch) on May 08, 2006 at 14:14 UTC
|
Whatever filesystem you are on, storing 500,000+ files in a single directory is a really bad idea if you are regularly going to need access to some small subset of those files from short runtime scripts (eg. cgi scripts).
Given that you have a nice regular format for the filenames, it would be much better to seggregate the files into smaller subsets by using some part of that filename as a directory name. Example: You might use
.\bobby\bobby2006-05-08.txt
or
.\2006\05\08\bobby2006-05-08.txt
Which of these schemes make most sense for your application will depend upon the details of the usage patterns, but the basic idea is to allow you to reduce the search space by going directly to some subset of files quickly.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
Re: efficient way to find most recently modified files?
by mildside (Friar) on May 08, 2006 at 05:23 UTC
|
Please don't be offended, but is this homework?Since at this stage I suspect it might be, I won't go into too much detail, but I suggest you look at glob and sort to start with. mildside | [reply] |
|
|
lol, not homework...i just really want to do this for my little "member blog" site.
just building stuff out of curiosity, which is how i sorta learn.
| [reply] |
|
|
OK, this is untested, but try something like:
my @files_sorted = sort { substr($a,-14,10) cmp substr($b,-14,10) } (g
+lob('yourpath/bobby*.txt'));
my $latest_file = $files_sorted[-1];
This assumes that all '.txt' files have the name format you mention. Cheers
mildside | [reply] [d/l] |
|
|
Re: efficient way to find most recently modified files?
by jwkrahn (Abbot) on May 08, 2006 at 11:08 UTC
|
Assuming that you only want files with the date embedded in the name you should be able to do something like this:
my %file = ( name => 'name', date => '0000-00-00' );
my $dir = 'somedir';
opendir DIR, $dir or die "Cannot open '$dir' $!";
while ( my $file = readdir DIR ) {
next unless $file =~ /(\d{4}-\d\d-\d\d)\.txt$/;
if ( $1 gt $file{ date } ) {
@file{ 'name', 'date' } = ( $file, $1 );
}
}
closedir DIR;
print "Oldest file is $file{name}.\n" if $file{ date } ne '0000-00-00'
+;
| [reply] [d/l] |
Re: efficient way to find most recently modified files?
by Anonymous Monk on May 08, 2006 at 15:59 UTC
|
hello,
<br
I read some of the suggestions. thanks for all the replies!
For davido's, it is not possible to always keep track of the most recent uploaded blog since I intend in the future that some of the users upload via FTP and not just through the web.
sorry for not mentioning this in my first post.
And for sub directories for each member, that solution can work but will make an extra step for the uploader. Its the uploaders job to download each text file from about 5-10 users everyday 1-2 days and then upload all the text files to a directory. If i would make it a sub-directory, the uploader will always have to navigate to /bobby/, /amanda/ etc and upload each text file individually as opposed to doing the job in one step by uploading all 5-10 text files in one directory.
For the 500,000+....thats just my curiosity. the application i m building now will probably never reach the case it reads a directory with 500,000+ files, but if it ever does come to the point where i will be building one, then i can keep it in mind.
Thanks!
nick | [reply] |
|
|
And for sub directories for each member, that solution can work but will make an extra step for the uploader.
A simple solution to that would be to have the files uploaded to a different directory and have a background script that monitors the upload directory and moves the files into the appropriate directory structure as they appear.
The only complication is ensuring that the file is fully uploaded before you attempt to move it. If you can open the file for exclusive access that's not a problem. If not, you might have to retain a list of files in the directory along with their sizes and only move them once the size hasn't changed for a minute or two.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
Re: efficient way to find most recently modified files?
by TedPride (Priest) on May 08, 2006 at 21:41 UTC
|
Just upload the files to the proper user directory as they come in. Stuffing everything into one directory is messy and decreases file system efficiency for large numbers of files (since it has to find inside ux files rather than u directories and then x files). Plus, depending on your file system, you may not be able to handle more than a certain number of files per directory anyway.
Or if you have to work in batch, just move the files from /userdate.txt to /user/date.txt. | [reply] |