Hi,

My guess is that the files are multimedia of some sort. If this is a search of the binary data that will occur on a reoccurring basis, you might as well go with MySQL or some other DBMS.

As this appears to be regressive search pattern, you will have to determine how many iterations of the 4 bytes you are going to store in the database. Preferrably you will want enough iterations stored for the file record to be as unique as possible. There really isn't an answer that any of us here can give you as to how many iterations... as it will depend highly on the files and the search patterns you are generating with the application.

In DBMS terms, you are looking to create indexes of binary files. NONE of the DBMSs do this directly themselves (text files and limited other documents .. yes) so you will have to spend some time and design the 'homemade' index.. naturally, you will have an index on this index to avoid table scanning. My recommendation is to store the 4 byte chunks (nibble) as characters rather than actual byte(4) as you will be able to create useful indexes on the columns.

You will have to make a trade off... whether you want it to be painful (slow) to add another binary file (scan the existing table entries - not files - for duplicate nibbles.. if duplicate entry exists, add another nibble to the search pattern and retest the new against the 'duplicate'.),
or painful on the main search (all entries will have say '10' nibbles stored in the dbms - scanning two entries that are duplicates would require reading of the two files in question - more i/o on the read)

generally if your users are going to be primarily reading the documents and not adding new ones, then make the insert more i/o intensive - you will save time on the retrieval.

generally if your users are going to be primary using it for storing, say video files, then a hard coded nibble count is probably best.

you will, of course, test both methods (and probably others) to determine which is best... or you could make a hybrid system but design, testing and maintenance is likely going to be much higher as the complexity increases.

You may want to consider using one of the AI perl modules or similar to attempt to find recognizable patterns that your application generates...

Storing the files (blobs) in the database is generally a bad idea as it neither MySQL or any of the other DBMSs have ever been very good at processing and storing large BLOBS with any reasonable amount of performance hit in comparison of storing them on the filesystem itself. Small blobs are okay, but remember that the larger the data row (or text/image datatype if you are used to Sybase) the more physical and logical i/o is required. You will always want to avoid unnecessary i/o (logical or physical).

If these are multimedia files, you might be able to get away with simply storing the metadata (name, video codec used, audio codec used, author, etc.)... it probably would be a good idea even if you don't plan on searching on the meta data just yet... that way you won't have to recreate the table definitions, etc.

Jason L. Froebe

No one has seen what you have seen, and until that happens, we're all going to think that you're nuts. - Jack O'Neil, Stargate SG-1


In reply to Re: Huge Table of BLOBs or Binary Flat-File Database? by jfroebe
in thread Huge Table of BLOBs or Binary Flat-File Database? by rjahrman

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.