When sorting data, there are often multiple 'keys' by which
you want to reference that data. For instance, you might
want to refer to data by the week number it was produced,
but you also want to refer to it by a grouped category, such
as a test pack name.

In these cases, one way to sort out a pile of data is to save
it to a set of files, whose names are the keys to the data
itself.

For instance, one file might be named "rawdata.11.Aspen", which
contains all extracted data for the Aspen test pack for week 11.
The word 'rawdata' would indicate that this data has simply been
sorted into file buckets, and still needs to be worked on.

There are lots of benefits to sorting this way. The main one is
that the file system has become your hashtable.

For example, suppose you have sorted out all your data into
these raw files. To get a list of all the test packs run in week 11,
all you have to do is:

my @week11 = <rawdata.11.*>;

Or, if you want all of the Aspen test data for all weeks:

my @aspen = <rawdata.*.Aspen>;

If you want to process your raw data files, there's no reason
you can't process them as they come, so:

&process $file foreach $file <rawdata.*>;

And your &process() subroutine can create processed-data
files, such as processed.Aspen.11 and totals.Aspen.11, and
so on for each test pack and each week. Then a &total()
subroutine can read in <totals.*> and sum up the totals.


Why bother using the filesystem? Why not use hash tables
and arrays?

Use hashtables when your key is simple or easy to fabricate,
or where you don't need to do searches for particular keys
or key groups. I don't think you can easily glob on hashtable
keys (can you?). Use hashtables and arrays when the data set
is small.

Additionally, use the filesystem to avoid writing a series of
nested loops. Nested loops are the bane of any cron job, using
precious machine time instead of (perhaps) less-precious disk
space.

Don't use hashtables and arrays when your data set is huge.
50 megabytes is huge. 1 megabyte might not be huge. Your
mileage may differ. Perhaps the key is to not put too much
stress on your computer's RAM.

All the things above which are not facts are my opinions.

Rob

Replies are listed 'Best First'.
Re: Sometimes, the File System is my hashtable...
by jepri (Parson) on Nov 27, 2001 at 16:36 UTC
    You can take the metaphor further and use:

    /datadir/key1/key2/key3/key4/.../file

    If you look at most high capacity mail routers (if you have linux, you have one of these) they use a similar scheme, with an index to get to the files quickly. There are people around who are doing mini-projects to have database-backed file systems, and file system databases. Expect an explosion when the hurd is released and there really is no difference between a file system and a database.

    Certainly this would be your most efficient choice for sorting a huge volume of information, if you have a good file system to do it on.

    There have been some hurtful words bandied around about file system databases vs. real database backing for different applications. I don't have links to the nodes but there are some interesting discussions about the best choice for a mailer, etc.

    ____________________
    Jeremy
    I didn't believe in evil until I dated it.

      Jeremy,

      Thanks for the tip. I've been living under a rock,
      so I didn't know this was widely used. Using the
      directory structure is an interesting database format,
      which I vaguely remember using many years ago (a very
      convenient way of archiving data... isn't a database
      just a wrapper for heirarchical storage?). I'll have
      to revisit the idea. Thanks for the reminder!

      Rob