in reply to 15 billion row text file and row deletes - Best Practice?
Unless you have a second disk in which to write the output file, I wouldn't use two files.
Besides the huge intermediate storage requirement which can be a problem for smaller systems, writing (and the OS locating space to write to) a disk, at the same time as you are reading from that disk can severly slow down the process. Especially if the disk is more than say 50% full and/or fragmented.
I'd use a two pass process.
If the kill list is small, load it all in a hash.
If it's too big for memory, load the kill list in to a hash in chunks and write the positions to another file.
The second pass only need the positions one at a time (once they are sorted).
This is a little tricky to envisage, so a diagram might help.
The records to be deleted A,J,O,R,Y,Z
The initial state of the bigfile: L,J,E,M,X,T,A,Z,G,U,W,Y,Q,R,C,K,O,P,V,I,D,H,N,S,F,B
The big file without the kill list: L, E,M,X,T, G,U,W, Q, C,K, P,V,I,D,H,N,S,F,B
The desired result after 'compaction': L,E,M,X,T,G,U,W,Q,C,K,P,V,I,D,H,N,S,F,B
The following code uses a ramfile with single character records and ',' as the record separator to demostrate the logic required:
#! perl -slw use strict; use Data::Dumper; use List::Util qw[ shuffle ]; ## Dummy up a hash of lines to kill. my %kills = map{ $_ => undef } ( shuffle( 'A' .. 'Z' ) )[ 0 .. 5 ]; print sort keys %kills; ## And a 'big file' my $bigfile = join ',', shuffle 'A' .. 'Z'; print $bigfile; ## Scan the bigfile recording the file positions ## where records are to be deleted. open my $fhRead, '<', \$bigfile; { local $/ = ','; my $lastPos = 0; while( <$fhRead> ) { chomp; ## Store the ranges (start & end) of each record to delete $kills{ $_ } = [ $lastPos, tell $fhRead ] if exists $kills{ $_ + }; $lastPos = tell $fhRead; } } ## Sort the ranges into ascending order my @posns = sort{ $a->[ 0 ] <=> $b->[ 0 ] } values %kills; ## Open a second write handle to the file open my $fhWrite, '+<', \$bigfile; { local $/ = ','; local $\; ## Move the file pointers for reading and writing ## to the end and start positions respectively my( $w1, $r1 ) = @{ shift @posns }; seek $fhWrite, $w1, 0; seek $fhRead, $r1, 0; while( @posns ) { ## Get the next set of positions my( $w2, $r2 ) = @{ shift @posns }; ## 'Move' the records up to the start of the next record to be + deleted print $fhWrite scalar <$fhRead> while tell( $fhRead ) < $w2; ## Advance the read head over the section to be removed. seek $fhRead, $r2, 0; } ## copy the residual remaining over print $fhWrite $_ while <$fhRead>; ## truncate the write filehandle. # truncate $fhWrite, tell $fhWrite; ## truncate doesn't work on ramfiles! $bigfile = substr $bigfile, 0, tell $fhWrite; } close $fhRead; close $fhWrite; print $bigfile; __END__ C:\test>junk HMNPRS L,R,J,Z,P,H,M,K,T,A,D,Q,B,Y,S,E,W,F,U,C,I,X,G,O,N,V L,J,Z,K,T,A,D,Q,B,Y,E,W,F,U,C,I,X,G,O,V C:\test>junk ABGKOT G,O,Q,C,U,H,P,I,R,M,S,J,V,W,X,Y,Z,D,E,A,B,L,N,T,K,F Q,C,U,H,P,I,R,M,S,J,V,W,X,Y,Z,D,E,L,N,F C:\test>junk AKLPTZ J,F,Q,S,E,X,I,B,U,K,R,A,M,W,Z,G,D,L,H,C,N,Y,O,V,T,P J,F,Q,S,E,X,I,B,U,R,M,W,G,D,H,C,N,Y,O,V, C:\test>junk AJORYZ L,J,E,M,X,T,A,Z,G,U,W,Y,Q,R,C,K,O,P,V,I,D,H,N,S,F,B L,E,M,X,T,G,U,W,Q,C,K,P,V,I,D,H,N,S,F,B
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: 15 billion row text file and row deletes - Best Practice?
by jynx (Priest) on Dec 01, 2006 at 22:44 UTC | |
by BrowserUk (Patriarch) on Dec 01, 2006 at 23:20 UTC |