rjahrman has asked for the wisdom of the Perl Monks concerning the following question:

I need to store a whole lot of binary data . . . possibly 100GB or more of data in 100,000+ files. I need to be able to search through one of the files in such a way that it takes 4 bytes, checks for a match, take the next 4 bytes if it's found it, or if not skip 4 bytes and repeat. I figure that I have two options to store this data:

1) Use Binary Files. I could have each set of binary data in its own file, and have all 100K of the files in one directory. I would open a file and read in the data n bytes at a time. The benefit is that I would only have to read as far into the file as I need to, but my filesystem would crap out with that many files in a directory. To get around this I could make 256 or so directories and put the files into different directories.

2) Use a mySQL Database. I could make a table with two columns. The first would be the identifier (otherwise the filename), and the second would be the data that would have been in the file. The ID would be a primary key, so it would be fast. I know that mySQL can handle 100K rows. It also wouldn't waste the left-over space that file size blocks take up (e.g. round up the size to the nearest 4KB). However, even if the script or program is on the same computer, wouldn't it be slowed down by needing to access and store in RAM as much as a 100MB BLOB all at once, even though it would often only need the first kilobyte (or less!) of that?

Any suggestions? BTW, I may or may not do this in Perl. (Opinions on that? I assume C would be faster...) I just posted here because I couldn't think of a good mySQL forum with enough traffic to get an answer for this type of a question.

Thanks in advance!

  • Comment on Huge Table of BLOBs or Binary Flat-File Database?

Replies are listed 'Best First'.
Re: Huge Table of BLOBs or Binary Flat-File Database?
by jfroebe (Parson) on Jun 14, 2004 at 04:03 UTC

    Hi,

    My guess is that the files are multimedia of some sort. If this is a search of the binary data that will occur on a reoccurring basis, you might as well go with MySQL or some other DBMS.

    As this appears to be regressive search pattern, you will have to determine how many iterations of the 4 bytes you are going to store in the database. Preferrably you will want enough iterations stored for the file record to be as unique as possible. There really isn't an answer that any of us here can give you as to how many iterations... as it will depend highly on the files and the search patterns you are generating with the application.

    In DBMS terms, you are looking to create indexes of binary files. NONE of the DBMSs do this directly themselves (text files and limited other documents .. yes) so you will have to spend some time and design the 'homemade' index.. naturally, you will have an index on this index to avoid table scanning. My recommendation is to store the 4 byte chunks (nibble) as characters rather than actual byte(4) as you will be able to create useful indexes on the columns.

    You will have to make a trade off... whether you want it to be painful (slow) to add another binary file (scan the existing table entries - not files - for duplicate nibbles.. if duplicate entry exists, add another nibble to the search pattern and retest the new against the 'duplicate'.),
    or painful on the main search (all entries will have say '10' nibbles stored in the dbms - scanning two entries that are duplicates would require reading of the two files in question - more i/o on the read)

    generally if your users are going to be primarily reading the documents and not adding new ones, then make the insert more i/o intensive - you will save time on the retrieval.

    generally if your users are going to be primary using it for storing, say video files, then a hard coded nibble count is probably best.

    you will, of course, test both methods (and probably others) to determine which is best... or you could make a hybrid system but design, testing and maintenance is likely going to be much higher as the complexity increases.

    You may want to consider using one of the AI perl modules or similar to attempt to find recognizable patterns that your application generates...

    Storing the files (blobs) in the database is generally a bad idea as it neither MySQL or any of the other DBMSs have ever been very good at processing and storing large BLOBS with any reasonable amount of performance hit in comparison of storing them on the filesystem itself. Small blobs are okay, but remember that the larger the data row (or text/image datatype if you are used to Sybase) the more physical and logical i/o is required. You will always want to avoid unnecessary i/o (logical or physical).

    If these are multimedia files, you might be able to get away with simply storing the metadata (name, video codec used, audio codec used, author, etc.)... it probably would be a good idea even if you don't plan on searching on the meta data just yet... that way you won't have to recreate the table definitions, etc.

    Jason L. Froebe

    No one has seen what you have seen, and until that happens, we're all going to think that you're nuts. - Jack O'Neil, Stargate SG-1

Re: Huge Table of BLOBs or Binary Flat-File Database?
by graff (Chancellor) on Jun 14, 2004 at 05:07 UTC
    I'm with jfroebe on the point about avoiding BLOBS -- there are better ways.

    Whether you end up using Perl or C should hinge on how comfortable you are with using "pack/unpack" vs. using ints or structs in C.

    jfroebe seemed to be familiar with what you might be doing, but I was puzzled by your description:

    I need to be able to search through one of the files in such a way that it takes 4 bytes, checks for a match, take the next 4 bytes if it's found it, or if not skip 4 bytes and repeat.

    Well, whatever the nature of the search task, I think the better approach to indexing and searching the file data would be something like this:

    • partition the 100K files into a sensible directory tree, to strike a good balance between tree depth and number of entries per directory (e.g. 40 directories, each with 50 subdirectories, each with 50 files, or maybe even something that represents "semantic" differences in file content, such as date, source or whatever).
    • use a database (SQL-based RDBMS or even any sort of DBM file approach) to store tuples of "index_data, path/filename", where the index data (the search/key value) is the minimum needed to uniquely identify the content of each file.
    This means most of the effort goes into building the index data, but you only do that once, and from then on, the actual searches have a lot less reading to do reach their targets.

    Of course, if any of the files differ only slightly in their contents -- and only near the end -- you may need to divide the search into "stages". You would group files into directories based on similarity, so that the first one, two or three stages of the search/match process serve to select the proper directory path, and the final stage of searching is only worrying over a small number of files.

    I can't make any more detailed suggestions because I don't understand the task well enough. In fact, I'm wondering if my suggestions are completely off the mark.

Re: Huge Table of BLOBs or Binary Flat-File Database?
by jZed (Prior) on Jun 14, 2004 at 03:55 UTC
    How about creating a three column table - with the key (filename) in the first column, the first 1k or so of the blob in the second column and the full blob in the third column. Seems like you could arrange it so that you very rarely if ever had to search the full blob. Then you could arrange your queries so you never gather more than x rows at a time (where x is the size of the blobs times whatever your RAM can manage).