Re: Huge Table of BLOBs or Binary Flat-File Database?

I'm with jfroebe on the point about avoiding BLOBS -- there are better ways.

Whether you end up using Perl or C should hinge on how comfortable you are with using "pack/unpack" vs. using ints or structs in C.

jfroebe seemed to be familiar with what you might be doing, but I was puzzled by your description:

I need to be able to search through one of the files in such a way that it takes 4 bytes, checks for a match, take the next 4 bytes if it's found it, or if not skip 4 bytes and repeat.

Well, whatever the nature of the search task, I think the better approach to indexing and searching the file data would be something like this:

partition the 100K files into a sensible directory tree, to strike a good balance between tree depth and number of entries per directory (e.g. 40 directories, each with 50 subdirectories, each with 50 files, or maybe even something that represents "semantic" differences in file content, such as date, source or whatever).
use a database (SQL-based RDBMS or even any sort of DBM file approach) to store tuples of "index_data, path/filename", where the index data (the search/key value) is the minimum needed to uniquely identify the content of each file.

This means most of the effort goes into building the index data, but you only do that once, and from then on, the actual searches have a lot less reading to do reach their targets.

Of course, if any of the files differ only slightly in their contents -- and only near the end -- you may need to divide the search into "stages". You would group files into directories based on similarity, so that the first one, two or three stages of the search/match process serve to select the proper directory path, and the final stage of searching is only worrying over a small number of files.

I can't make any more detailed suggestions because I don't understand the task well enough. In fact, I'm wondering if my suggestions are completely off the mark.

Comment on Re: Huge Table of BLOBs or Binary Flat-File Database?