comment on

I'm with jfroebe on the point about avoiding BLOBS -- there are better ways.

Whether you end up using Perl or C should hinge on how comfortable you are with using "pack/unpack" vs. using ints or structs in C.

jfroebe seemed to be familiar with what you might be doing, but I was puzzled by your description:

I need to be able to search through one of the files in such a way that it takes 4 bytes, checks for a match, take the next 4 bytes if it's found it, or if not skip 4 bytes and repeat.

Well, whatever the nature of the search task, I think the better approach to indexing and searching the file data would be something like this:

partition the 100K files into a sensible directory tree, to strike a good balance between tree depth and number of entries per directory (e.g. 40 directories, each with 50 subdirectories, each with 50 files, or maybe even something that represents "semantic" differences in file content, such as date, source or whatever).
use a database (SQL-based RDBMS or even any sort of DBM file approach) to store tuples of "index_data, path/filename", where the index data (the search/key value) is the minimum needed to uniquely identify the content of each file.

This means most of the effort goes into building the index data, but you only do that once, and from then on, the actual searches have a lot less reading to do reach their targets.

Of course, if any of the files differ only slightly in their contents -- and only near the end -- you may need to divide the search into "stages". You would group files into directories based on similarity, so that the first one, two or three stages of the search/match process serve to select the proper directory path, and the final stage of searching is only worrying over a small number of files.

I can't make any more detailed suggestions because I don't understand the task well enough. In fact, I'm wondering if my suggestions are completely off the mark.

In reply to Re: Huge Table of BLOBs or Binary Flat-File Database? by graff
in thread Huge Table of BLOBs or Binary Flat-File Database? by rjahrman

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.