Re: 15 billion row text file and row deletes

Unless you have a second disk in which to write the output file, I wouldn't use two files.

Besides the huge intermediate storage requirement which can be a problem for smaller systems, writing (and the OS locating space to write to) a disk, at the same time as you are reading from that disk can severly slow down the process. Especially if the disk is more than say 50% full and/or fragmented.

I'd use a two pass process.

Pass 1 reads the file and records the start&end pairing of each record to be deleted.
If the kill list is small, load it all in a hash.
If it's too big for memory, load the kill list in to a hash in chunks and write the positions to another file.
The second pass only need the positions one at a time (once they are sorted).
The second pass sorts the positions and then opens two filehandles to the data file. 1 for reading, 1 writing. It then reads via the former, writing to the latter, overwriting the bits that need deleting. Finally truncating the file (via the write handle) to it's final length.

This is a little tricky to envisage, so a diagram might help.

The records to be deleted A,J,O,R,Y,Z

The initial state of the bigfile: L,J,E,M,X,T,A,Z,G,U,W,Y,Q,R,C,K,O,P,V,I,D,H,N,S,F,B

The big file without the kill list: L, E,M,X,T, G,U,W, Q, C,K, P,V,I,D,H,N,S,F,B

The desired result after 'compaction': L,E,M,X,T,G,U,W,Q,C,K,P,V,I,D,H,N,S,F,B

The following code uses a ramfile with single character records and ',' as the record separator to demostrate the logic required:

#! perl -slw
use strict;
use Data::Dumper;
use List::Util qw[ shuffle ];

## Dummy up a hash of lines to kill.
my %kills = map{ $_ => undef } ( shuffle( 'A' .. 'Z' ) )[ 0 .. 5 ];
print sort keys %kills;

## And a 'big file'
my $bigfile = join ',',  shuffle 'A' .. 'Z';
print $bigfile;

## Scan the bigfile recording the file positions
## where records are to be deleted.
open my $fhRead, '<', \$bigfile;
{
    local $/ = ',';
    my $lastPos = 0;
    while( <$fhRead> ) {
        chomp;
        ## Store the ranges (start & end) of each record to delete
        $kills{ $_ } = [ $lastPos, tell $fhRead ] if exists $kills{ $_
+ };
        $lastPos = tell $fhRead;
    }
}

## Sort the ranges into ascending order
my @posns = sort{ $a->[ 0 ] <=> $b->[ 0 ] } values %kills;

## Open a second write handle to the file
open my $fhWrite, '+<', \$bigfile;
{
    local $/ = ','; local $\;

    ## Move the file pointers for reading and writing
    ## to the end and start positions respectively
    my( $w1, $r1 ) = @{ shift @posns };
    seek $fhWrite, $w1, 0;
    seek $fhRead, $r1, 0;

    while( @posns ) {
        ## Get the next set of positions
        my( $w2, $r2 ) = @{ shift @posns };

        ## 'Move' the records up to the start of the next record to be
+ deleted
        print $fhWrite scalar <$fhRead> while tell( $fhRead ) < $w2;

        ## Advance the read head over the section to be removed.
        seek $fhRead, $r2, 0;
    }
    ## copy the residual remaining over
    print $fhWrite $_ while <$fhRead>;

    ## truncate the write filehandle.
    # truncate $fhWrite, tell $fhWrite;

    ## truncate doesn't work on ramfiles!
    $bigfile = substr $bigfile, 0, tell $fhWrite;
}
close $fhRead;
close $fhWrite;

print $bigfile;

__END__
C:\test>junk
HMNPRS
L,R,J,Z,P,H,M,K,T,A,D,Q,B,Y,S,E,W,F,U,C,I,X,G,O,N,V
L,J,Z,K,T,A,D,Q,B,Y,E,W,F,U,C,I,X,G,O,V

C:\test>junk
ABGKOT
G,O,Q,C,U,H,P,I,R,M,S,J,V,W,X,Y,Z,D,E,A,B,L,N,T,K,F
Q,C,U,H,P,I,R,M,S,J,V,W,X,Y,Z,D,E,L,N,F

C:\test>junk
AKLPTZ
J,F,Q,S,E,X,I,B,U,K,R,A,M,W,Z,G,D,L,H,C,N,Y,O,V,T,P
J,F,Q,S,E,X,I,B,U,R,M,W,G,D,H,C,N,Y,O,V,

C:\test>junk
AJORYZ
L,J,E,M,X,T,A,Z,G,U,W,Y,Q,R,C,K,O,P,V,I,D,H,N,S,F,B
L,E,M,X,T,G,U,W,Q,C,K,P,V,I,D,H,N,S,F,B
[download]

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

Comment on Re: 15 billion row text file and row deletes - Best Practice? Select or Download Code

Replies are listed 'Best First'.
Re^2: 15 billion row text file and row deletes - Best Practice? by jynx (Priest) on Dec 01, 2006 at 22:44 UTC
Would it be faster to do relative seeks? I'm just thinking that toward the end of the file you're seeking across 300+GBs of data every time you delete a single line. would `seek $fhRead, $r2 - $w2, 1;` [download] work? Just a thought... Edit: While it's obvious, i think it's also worth noting that if each record is the same size (which seems likely) then it can be done in one pass instead of two (or at least, one pass for each chunk of kill file, depending on kill file size).	[reply] [d/l]
Re^3: 15 billion row text file and row deletes - Best Practice? by BrowserUk (Patriarch) on Dec 01, 2006 at 23:20 UTC
Would it be faster to do relative seeks? It would, but I think almost all modern implementations of seek would automatically translate an absolute seek into a relative seek anyway. I don't know for sure, but I find it hard to believe that an absolute seek would ever cause the OS to 'go back to the beginning of the file'. In reality, seeking simply adjusts a number somwhere and it's not until the next read or write that it has any affect at all. The third parameter to seek is simply a throwback to when files stored on tape were commonplace. With regard to the possibility that the file has fixed length records, you're right it would be much easier and faster if that were the case. But from the (scant) information provided by the OP, it seems unlikely. Whilst he shows the serial number padded with leading zeros, he also shows a name field, which is empty in the given example, but names vary in length, so unless they are all empty, it probably indicates that these are variable length records. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]


Perl Monk, Perl Meditation
	PerlMonks

Re: 15 billion row text file and row deletes - Best Practice?