Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Oh wise monks, I appeal to thee.

This is my task. I often receive a stack of CD-ROMs from clients with several files on them. The files all make up a single dataset, but they are always split into multiple files (my guess is that there is some sort of limitation on the export tools from their database system).

Anyway, what I have been doing is cat'ing the files all into a single massive file (often between 2 and 8 gig). This is all well and good, but i would rather be able to just simply access the entire set as a single file. I have used Tie::File for array-based access on single files, but is there any perl-ish way to treat file1.txt, file2.txt, ... fileN.txt as a single file for the purposes of tying?

Many thanks!

Replies are listed 'Best First'.
Re: tie multiple files to a single array?
by BrowserUk (Patriarch) on Jul 07, 2003 at 08:48 UTC

    Whilst wouldn't be too hard to write a Tie::Files module to access the lines of multiple read-only files as a single array, the benefits are dubious. It would be necessary to (internally) read the whole of file1 before it could allow you too access file2, as this is the only way to determine how many records are in file1, and therefore, at which array element file2 starts. The same is true for reading all of file2 before being able to access file 3 and so on.

    Unless your records are of fixed length, even if you needed to only access a few records from each of the constituent files, you would still need to read every record to do it. And if you need to access the records out of their original order, every record still has to be processed in its original order once before you could randomly access the records. With 2-8 GB, this is going to impose a huge start-up delay.

    That said, for random access to make sense, you would need to know which of the approx 100 million records you need to access before opening the files, which seems unlikely given the data is coming from a third party source, but if it is true, then why not ask the third party source to supply the records you need rather than the whole lot? :)

    The only way this really makes sense is if the dataset constitutes a sequentially numbered set of records used as a lookup table, in which case you'd probably be wise to think about importing the dataset into a database and accessing it that way.

    If the dataset changes frequently, or if you only use them once or a few times before discarding, or if you have many such datasets that you prefer to access directly off of the CD's to save storage, then I could see the benefit in creating an index to the dataset(s) as a seperate file and using that to access the dataset randomly.

    Creating and accessing this index efficiently would be an interesting project. I'd probably think about using a tied array to a file of binary record positions, using 1 or 2 bytes to indicate the file/CD on which it starts (or CD and file within that CD) and as many bytes as needed to indicate the offset within that file at which the record starts.

    The upshot is that without a clearer picture of the nature of the data, or at least how it is to be accessed, there are simply to many possibilities to reach any useful conclusions.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller


      Yeah, i need random access to the file contents. I will know in advance which lines I will need, but I am thinking now that I should try this: 1) Figure out which rows I need. 2) Figure out the number of lines in each file 3) Build a module that wraps Tie::File, but switches the tied filehandle for the appropriate row number. 4) close everything up. Thanks for the help!

        Tie::File is great module for the purpose for which it was designed, essentially in-place editing of huge files, but the very features that make it so useful for that are likely to get in the way and slow your application down. That your files (being on CD-ROM) are read-only, justs means all the clever code in there for caching and deffered writing would be redundant.

        Of course, creating this index only makes sense if your going to need to use it more than once, and that brings me back to the final point I made in my last post. Deciding which of the many possibilities, is the 'best' approach to solving this problem really requires a good description of how the application is going to acccess the files, and how frequently. These are a few questions I would ask myself before I decided which way to do this.

        • How often will the application run?

          If it will only run once, there little point in making an index, nor worrying about efficiency.

        • How important is a timely result?

          If the application will run intereactively, with a user (or another machine) sitting there waiting for the result, then re-indexing (essentially what Tie::File would have to do) 8GB of data each time it runs would be incredibly wasteful and interminably slow.

        • How random does the random access need to be?

          If you have a list of record numbers that you need to extract, and there are no dependancies between them, then sorting that list, and processing the files sequentially, counting and extracting the records as you go is probably about as efficient as it likely to get if you only need to do it once.

          Conversely, if you need to access the records in a random order and especially if there are dependancies and/or the order can vary at runtime, then it would probably make sense to build an index to the records.

          If your going to process the files more than once, it definitely would.

        There are also various ways that you could build the index, with the usual trade-offs between size and speed applying.

        Without greater insights to the nature of application it's pointless speculating further, but given the composite size and read-only nature of the files involved, and the need to wrap code around Tie::File to achieve your purpose, I'm pretty sure that there is a better way to go than that.

        Good luck.


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller


Re: tie multiple files to a single array?
by sauoq (Abbot) on Jul 07, 2003 at 05:13 UTC

    I don't know if something like that is currently available on CPAN. I doubt it, but then again, I wouldn't be surprised... It wouldn't be too hard to write a "Tie::MultiFile" (or whatever) for read-only access though. If you were to allow updates, it might be hard to decide which file to update when an insertion occurred on the boundary between two files.

    Do you really need random access to the lines in the files? If not, can you get by with populating @ARGV and just using the empty diamond operator?

    -sauoq
    "My two cents aren't worth a dime.";
    
Re: tie multiple files to a single array?
by waswas-fng (Curate) on Jul 07, 2003 at 07:14 UTC
    I guess it depends what you are doing with the tied array. If you are just proccessing each line it may be easier to:
    foreach $filename (@all_file_names) { open CURFILE $filename; while (<CURFILE>) { chomp; #do stuff with each line here... } close(CURFILE); }


    -Waswas