in reply to Creating Archives for Files > 4GB

What kind of data is it? In particular, are you compressing 6 GB files into zip archives? Or do you have a collection of much smaller files that are collectively 6 GB?

Assuming you have the disk space, what if the information your operators are extracting was just sitting in directories? That may not be a solution, but if your system could function that way, then maybe you can think about other ways to approach what you're doing. A database would be another way, except then you'd have to maintain a database...

--marmot

Replies are listed 'Best First'.
Re^2: Creating Archives for Files > 4GB
by hoffy (Acolyte) on Jul 26, 2010 at 00:06 UTC

    Thanks for the reply

    I am talking about one file of 6GB of data. These files are just straight ASCII characters (Financial institution Statement files), so when using something like 7 zip, they compress down reasonably well. I would use 7 zip, but an over zealous auditor and manager has decreed that Open Source is bad.... (You wouldn't believe how much hassle I had to go through to get PERL approved! I am not even allowed to download modules from the CPAN :-( )

      Ye Gods! My condolences. :-(

      Well...hmmmm. My thoughts all go to breaking your file up into chunks. I mean, on the face of it, you've got a system where your data source drops 6GB files and actually expects someone to use them. But if you can't get Perl installed easily, you're probably not in a position to re-engineer your company's processes ("Excuse me sir, I think I'm smarter than you and...what's that?...yes, I like working here...oh...sorry...").

      So, for a chunk-wise example, you can use sysread() to read a monster text file (perhaps this one) in smallish chunks (512k, 4MB, whatever). You write a function like get_next_chunk() that manages the chunk-reading as needed, finding the start and end of the current "record", as defined by you. Then you write your main function with a while(get_next_record()){} loop, and it never has to know about sysread() or chunks at all.

      So now you abstract this a bit further. In some pre-process, you break your data into chunks (size dependent on memory and performance) and zip them separately. Then your get_next_record() function uses Archive::Zip or IO::Compress to read and decompress each chunk.

      It might require a bit of glue in the middle of your current process, but this is where I'd start. I realize I'm talking through my hat here because I don't know anything about the structure of your files or what you're doing with them.

      Cheers!
      --marmot

      Update: Got back from an inspection and clarified my comments.

        Again, thanks for your input there marmot

        In the end, though, the simpliest options are often the best. IO::Compress::Zip is probably the easiest to use for what I need. It appears that my intended audience have the right tools to deal with Zip64 files(always a good thing to do the research in the first place, instead of heading into tangent land).

        So, on that note, I am assuming that this is case closed! Thanks to everyone who gave me their input!

        hoffy