in reply to Re: Strategy for randomizing large files via sysseek
in thread Strategy for randomizing large files via sysseek
First reply: I can't imagine Tie::File can be any faster than using sysseek. That's pretty low-level already.
Second reply:
It's the second case: all files are randomized to produce many other files. I've even tried combining all input files into a big one to make handling things easier. Having 50 filehandles open on (relatively) smaller files seems to be a little faster however.
This is going to be imported elsewhere, so I cannot use the STDOUT trick.
I guess your increasing mem usage comes from the record of which lines you already output. You could return some lines (perhaps 10%), then re-write the source files without the used up lines, rinse repeat...Rewriting the source files could get pretty ugly, but that might be a possible solution. I think the memory must be leaking somewhere though, as even a giant hash should not be taking up that much memory. I hope.
You could clobber the start of the lines you have used with some unique token XXX-Clobbered-XXX then you do not need to store the used lines record in RAM, of course as you get past 50% your random line will hit a clobbered one more often than not so a re-write of the source data purging the used lines would still be required at some point.I can't rewrite the original files - it would lose the fixed-record-ness. It would also mean that instead of a hash lookup, I would have to sysseek, sysread, check if it has XXX, and try again. The bottleneck is really speed at this point, the memory drain is just worrisome.
You could put in more RAM or faster disk :)Faster disk, perhaps. I thought about splitting the files onto different physical disks, to make sysseek faster, but I don't have enough disks to really make a difference. These are big files: I think I did a rough calculation that I would need well ove 4 Gigs of RAM to store all the information in memory.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^3: Strategy for randomizing large files via sysseek
by Random_Walk (Prior) on Sep 09, 2004 at 15:18 UTC | |
by Anonymous Monk on Sep 09, 2004 at 17:32 UTC | |
|
Re^3: Strategy for randomizing large files via sysseek
by ketema (Scribe) on Sep 14, 2004 at 16:19 UTC | |
by BrowserUk (Patriarch) on Sep 14, 2004 at 17:18 UTC | |
by Anonymous Monk on Sep 15, 2004 at 13:02 UTC | |
by BrowserUk (Patriarch) on Sep 15, 2004 at 14:08 UTC | |
by ketema (Scribe) on Sep 15, 2004 at 18:40 UTC | |
|