Designing storage of uploaded files

hacker has asked for the wisdom of the Perl Monks concerning the following question:

Hello again fellow monks...

One of the phases of the portal system I've been working on is designed to allow users to upload files (Palm documents, ebooks in Palm format, etc.) to allow other users to download and use them. I have the file upload portion of my script working at a very rudimentary level, and looks like this:

use strict;
use Env;
use CGI qw(:standard);
use Digest::MD5  qw(md5 md5_hex md5_base64);
use POSIX qw(strftime);
use Date::Manip;

$ENV{'PATH'} = '/usr/bin:/bin:';   

my $query       = CGI->new;
my $modtime     = scalar(gmtime(time - (3600 * 8))) . " GMT";
my $exptime     = scalar(gmtime) . " GMT";

######################################################
#
# Form stuff asking for the upload, title, filename, 
# author, and other relevant file-specific data here
#
######################################################

# ...

sub print_results {
        my $query = shift;
        my ($length, $filename, $filetype);
        my $directory = "/tmp/palm";
        my $file_name = $query->upload('pl_upload');

        if (!$file_name) {
                print "No file received..\n";
                return;
        }

        $file_name =~ s/.*[\/\\](.*)/$1/;
        print h3("File Name"), $file_name;
        my $md5file = md5_hex($file_name);
        open(SAVEPDB,">$directory/${file_name}_${md5file}")
             or die $!;

        while (<$file_name>) {
                print SAVEPDB $_;
                $length += length($_);
        }
        close SAVEPDB;

        # Print file data here, size, title, etc.
}
[download]

From here, I simply print the results of the file's size, type, title, and other form elements entered for diagnostics. This part works perfectly.

Thanks to tye, ChemBoy, and ferrency's help and suggestions earlier today I will be using a newer construct using IPC::Open2 instead of my current system() call to retrieve the compression type stored inside the binary file itself (DOC or zlib).

Note that I'm saving the file as an md5sum'd filename, to avoid collisions with duplicate documents being uploaded, etc.

Here's the rub, I need to find a good workable design to allow thousands of users to upload files and content in this format to the server, which will sit in an approval queue, before being made "live" on the site for others to download.

What is the best approach to doing this? Blob them in a MySQL database? Store on the filesystem? Both? And doing so, how do I track which filename belongs with which "actual" file, so when the file(s) are listed on a webpage for download, the title is something human readable, not 'e3206099b8ad73408762ab0ea5e8f1f2'.

I've never done something like this before (tracking, storing persistant files/file data), so I'm a bit green. I'd eventually like the whole process of approval to be web-based, but for now I can deal with some manual intervention at the filesystem or database level. My concerns are:

Filename persistance and reducing collisions with duplicate filenames
Allowing users to upload "newer" versions of the same file, such as an "updated" version, which will overwrite/supercede the existing one they may have previously uploaded
I/O on the disk side of things, if the directory has 400,000 files in it, I'd like to still be able to respond within a reasonable timespan
Separation of approved/non-approved data (different directories? a boolean flag in the database?)
Security of the directories, without having to copy files into and out of private and public locations

Each file uploaded will have several bits of information associated with it, such as:

Date of submittal
REMOTE_ADDR (ip)
REMOTE_HOST (hostname)
USER_AGENT (remote browser, help to define OS)
Document title
Submitter's name
Copyright (bool, yes/no)
Image depth (0bpp..16bpp)
Category (categories provided by my form)
Description
Filename

To that end, I've come up with the following basic schema for a table to hold this data:

CREATE TABLE sample_uploads (
  sample_id tinyint(4) NOT NULL default '0',
  sample_submit_date datetime NOT NULL default '0000-00-00 00:00:00',
  sample_remote_addr text NOT NULL,
  sample_remote_host text NOT NULL,
  sample_remote_ua text NOT NULL,
  sample_db_title text NOT NULL,
  sample_user_name text NOT NULL,
  sample_copyright tinyint(4) NOT NULL default '0',
  sample_image_depth tinyint(4) NOT NULL default '0',
  sample_category tinyint(4) NOT NULL default '0',
  sample_description text NOT NULL,
  sample_filename text NOT NULL,
  PRIMARY KEY (sample_id)
) TYPE=MyISAM;
[download]

Does this approach hold water? The only bits I think I'm missing, are what to do with the file(s) sent, how to access them, and how to make sure the user sees "human" content (filenames, titles), while the system sees "protected" (md5sum) content.

Constructive ideas and architecture approaches are welcome. Thanks.

Comment on Designing storage of uploaded files Select or Download Code

Replies are listed 'Best First'.
(jeffa) Re: Designing storage of uploaded files by jeffa (Bishop) on Aug 20, 2002 at 03:06 UTC
With thousands of potential users uploading files to a common area, you will run into collisions. Why not allow them to have their own directory (that is their user name) and store their files there. Personally, i don't like BLOBS - just use the filesystem. I would adopt the system that PAUSE uses for colliding file names as well: Please, make sure your filename contains a version number. For security reasons you will never be able to upload a file with the same name again (not even after deleting it). Sounds a lot easier than trying to juggle cats, which is what you might end up doing with the system you propose. UPDATE (10 or so hours later) ... Last night in the CB we discussed this further and you mentioned that your site does not currently offer users to register for accounts. If it were up to me, i would concentrate on getting user accounts up and running first. Here is why: First, it's not that hard. maverick, JackHammer, bliz, and myself (under the management of eduardo) built an authentication site (that also offered customizable authorization) in 3 days using nothing but CPAN modules from the Apache::Auth family. It is not that hard and does not take a terribly long time to do. (And to be honest, instead of that, i would consider using a content management tool such as Slash or even PostNuke.) Second, you mentioned that you plan on adding this functionality at a later point in time. Why not do it now? This is something i keep asking you time and time again. Why not do it now? What is going to happen to this system when you do add users to your site? You are going to have to modify your file upload code to accomodate them. Why not just do it now? I have witnessed your frustration at trying to port your existing CGI apps over to Apache::Registry when kind folks here at PM suggested you go straight to mod_perl. In the time it took you to work out countless bugs (and claim that the tool was broken) you could have first done some research and testing with smaller 'Hello world' type examples and had you site running under a more robust system. I have witnessed your frustration trying to coax CGI into output the HTML you wanted when i myself kept recommending a templating solution. In the time it took you to work out the kinks and intricacies of CGI.pm, you could have done a little homework and been up and running with HTML::Template, a little more and you could already have had site running with TT2. I recommend you drop this file upload feature and add users instead. It will be time well spent. So, is there anything wrong with allowing anonymous users to upload files to your site? Of course not, not with the proper precautions. But, you are making more work for yourself in the long run, and you not practicing the art of True Laziness. I wish you well and hope your site is a success. jeffa L-LL-L--L-LL-L--L-LL-L-- -R--R-RR-R--R-RR-R--R-RR B--B--B--B--B--B--B--B-- H---H---H---H---H---H--- (the triplet paradiddle with high-hat)	[reply]
Re: Designing storage of uploaded files by thraxil (Prior) on Aug 20, 2002 at 15:00 UTC
security-wise, the lines: `$file_name =~ s/.[\/\\](.)/$1/;` and `open(SAVEPDB,">$directory/${file_name}_${md5file}") or die $!;` [download] concern me. it's a good idea to use taint mode and do something more like: `if($file_name =~ /(\w+\.?\w+)$/) { $file_name = $1; } else { die "invalid and possibly dangerous characters in filename." }` [download] where you explicitly limit the characters that can be in the filename. anders pearson	[reply] [d/l] [select]
Re: Designing storage of uploaded files by blokhead (Monsignor) on Aug 20, 2002 at 16:36 UTC
Using a hash function to avoid namespace collisions won't work if you just hash the filename. If 'H' is your hash function, H(x) == H(y) if x == y. You will still get the same hash value from identical filenames. To mix things up, perhaps hash the filename concated with something non-static like the return value of localtime or something (`md5_hex($filename . localtime)`). You could also use crypt()'s hash function and add random salt each time a file is uploaded. Of course, this is all assuming you're intent on using hashes. However, I see no point in doing so. You will not be able to go backwards from the hashed value to the original filename -- only if the user must enter a filename for your script to retreive (so you can hash that and look for the hash value in the DB). And even that will only work if you do not add salt to the filename to prevent collisions. So a hash seems silly to me. I would highly suggest -- especially for a large-scale project like this -- storing the files in SQL blobs instead of in real files. A malicious user can put whatever characters they want in the filename, but you don't need to worry with an SQL implementation. Using the hash is a noble way to create filenames that are "safe", but like I said, the hash will be one way, and it seems like you want the filename back. Plus, you really have to be on your guard when you let CGI scripts write to files and especially create new files. May I suggest altering your table so that the sample_id field is AUTO_INCREMENT -- let SQL take care of the primary key for you. This way, you can have multiple files with identical names, just refer to them always by their unique id (myscript.pl?file=42). You wouldn't have to try to avoid namespace collisions (unless there were other reasons for doing so). If you really need a directory structure to these files, create a column sample_is_folder (boolean) and a sample_parent column so you can set up a tree-ish structure. Well, there would obviously be more to it than that, but hopefully you get the idea. .... Oh yes, and of course add the BLOB/LARGEBLOB column for the uploaded files if you choose that route. Good luck!	[reply] [d/l]


No such thing as a small change
	PerlMonks