Whilst wouldn't be too hard to write a Tie::Files module to access the lines of multiple read-only files as a single array, the benefits are dubious. It would be necessary to (internally) read the whole of file1 before it could allow you too access file2, as this is the only way to determine how many records are in file1, and therefore, at which array element file2 starts. The same is true for reading all of file2 before being able to access file 3 and so on.

Unless your records are of fixed length, even if you needed to only access a few records from each of the constituent files, you would still need to read every record to do it. And if you need to access the records out of their original order, every record still has to be processed in its original order once before you could randomly access the records. With 2-8 GB, this is going to impose a huge start-up delay.

That said, for random access to make sense, you would need to know which of the approx 100 million records you need to access before opening the files, which seems unlikely given the data is coming from a third party source, but if it is true, then why not ask the third party source to supply the records you need rather than the whole lot? :)

The only way this really makes sense is if the dataset constitutes a sequentially numbered set of records used as a lookup table, in which case you'd probably be wise to think about importing the dataset into a database and accessing it that way.

If the dataset changes frequently, or if you only use them once or a few times before discarding, or if you have many such datasets that you prefer to access directly off of the CD's to save storage, then I could see the benefit in creating an index to the dataset(s) as a seperate file and using that to access the dataset randomly.

Creating and accessing this index efficiently would be an interesting project. I'd probably think about using a tied array to a file of binary record positions, using 1 or 2 bytes to indicate the file/CD on which it starts (or CD and file within that CD) and as many bytes as needed to indicate the offset within that file at which the record starts.

The upshot is that without a clearer picture of the nature of the data, or at least how it is to be accessed, there are simply to many possibilities to reach any useful conclusions.


Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller



In reply to Re: tie multiple files to a single array? by BrowserUk
in thread tie multiple files to a single array? by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.