in reply to tie multiple files to a single array?
Whilst wouldn't be too hard to write a Tie::Files module to access the lines of multiple read-only files as a single array, the benefits are dubious. It would be necessary to (internally) read the whole of file1 before it could allow you too access file2, as this is the only way to determine how many records are in file1, and therefore, at which array element file2 starts. The same is true for reading all of file2 before being able to access file 3 and so on.
Unless your records are of fixed length, even if you needed to only access a few records from each of the constituent files, you would still need to read every record to do it. And if you need to access the records out of their original order, every record still has to be processed in its original order once before you could randomly access the records. With 2-8 GB, this is going to impose a huge start-up delay.
That said, for random access to make sense, you would need to know which of the approx 100 million records you need to access before opening the files, which seems unlikely given the data is coming from a third party source, but if it is true, then why not ask the third party source to supply the records you need rather than the whole lot? :)
The only way this really makes sense is if the dataset constitutes a sequentially numbered set of records used as a lookup table, in which case you'd probably be wise to think about importing the dataset into a database and accessing it that way.
If the dataset changes frequently, or if you only use them once or a few times before discarding, or if you have many such datasets that you prefer to access directly off of the CD's to save storage, then I could see the benefit in creating an index to the dataset(s) as a seperate file and using that to access the dataset randomly.
Creating and accessing this index efficiently would be an interesting project. I'd probably think about using a tied array to a file of binary record positions, using 1 or 2 bytes to indicate the file/CD on which it starts (or CD and file within that CD) and as many bytes as needed to indicate the offset within that file at which the record starts.
The upshot is that without a clearer picture of the nature of the data, or at least how it is to be accessed, there are simply to many possibilities to reach any useful conclusions.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Re: tie multiple files to a single array?
by Anonymous Monk on Jul 07, 2003 at 14:13 UTC | |
by BrowserUk (Patriarch) on Jul 08, 2003 at 00:09 UTC |