comment on

Unless you have a second disk in which to write the output file, I wouldn't use two files.

Besides the huge intermediate storage requirement which can be a problem for smaller systems, writing (and the OS locating space to write to) a disk, at the same time as you are reading from that disk can severly slow down the process. Especially if the disk is more than say 50% full and/or fragmented.

I'd use a two pass process.

Pass 1 reads the file and records the start&end pairing of each record to be deleted.
If the kill list is small, load it all in a hash.
If it's too big for memory, load the kill list in to a hash in chunks and write the positions to another file.
The second pass only need the positions one at a time (once they are sorted).
The second pass sorts the positions and then opens two filehandles to the data file. 1 for reading, 1 writing. It then reads via the former, writing to the latter, overwriting the bits that need deleting. Finally truncating the file (via the write handle) to it's final length.

This is a little tricky to envisage, so a diagram might help.

The records to be deleted A,J,O,R,Y,Z

The initial state of the bigfile: L,J,E,M,X,T,A,Z,G,U,W,Y,Q,R,C,K,O,P,V,I,D,H,N,S,F,B

The big file without the kill list: L, E,M,X,T, G,U,W, Q, C,K, P,V,I,D,H,N,S,F,B

The desired result after 'compaction': L,E,M,X,T,G,U,W,Q,C,K,P,V,I,D,H,N,S,F,B

The following code uses a ramfile with single character records and ',' as the record separator to demostrate the logic required:

#! perl -slw
use strict;
use Data::Dumper;
use List::Util qw[ shuffle ];

## Dummy up a hash of lines to kill.
my %kills = map{ $_ => undef } ( shuffle( 'A' .. 'Z' ) )[ 0 .. 5 ];
print sort keys %kills;

## And a 'big file'
my $bigfile = join ',',  shuffle 'A' .. 'Z';
print $bigfile;

## Scan the bigfile recording the file positions
## where records are to be deleted.
open my $fhRead, '<', \$bigfile;
{
    local $/ = ',';
    my $lastPos = 0;
    while( <$fhRead> ) {
        chomp;
        ## Store the ranges (start & end) of each record to delete
        $kills{ $_ } = [ $lastPos, tell $fhRead ] if exists $kills{ $_
+ };
        $lastPos = tell $fhRead;
    }
}

## Sort the ranges into ascending order
my @posns = sort{ $a->[ 0 ] <=> $b->[ 0 ] } values %kills;

## Open a second write handle to the file
open my $fhWrite, '+<', \$bigfile;
{
    local $/ = ','; local $\;

    ## Move the file pointers for reading and writing
    ## to the end and start positions respectively
    my( $w1, $r1 ) = @{ shift @posns };
    seek $fhWrite, $w1, 0;
    seek $fhRead, $r1, 0;

    while( @posns ) {
        ## Get the next set of positions
        my( $w2, $r2 ) = @{ shift @posns };

        ## 'Move' the records up to the start of the next record to be
+ deleted
        print $fhWrite scalar <$fhRead> while tell( $fhRead ) < $w2;

        ## Advance the read head over the section to be removed.
        seek $fhRead, $r2, 0;
    }
    ## copy the residual remaining over
    print $fhWrite $_ while <$fhRead>;

    ## truncate the write filehandle.
    # truncate $fhWrite, tell $fhWrite;

    ## truncate doesn't work on ramfiles!
    $bigfile = substr $bigfile, 0, tell $fhWrite;
}
close $fhRead;
close $fhWrite;

print $bigfile;

__END__
C:\test>junk
HMNPRS
L,R,J,Z,P,H,M,K,T,A,D,Q,B,Y,S,E,W,F,U,C,I,X,G,O,N,V
L,J,Z,K,T,A,D,Q,B,Y,E,W,F,U,C,I,X,G,O,V

C:\test>junk
ABGKOT
G,O,Q,C,U,H,P,I,R,M,S,J,V,W,X,Y,Z,D,E,A,B,L,N,T,K,F
Q,C,U,H,P,I,R,M,S,J,V,W,X,Y,Z,D,E,L,N,F

C:\test>junk
AKLPTZ
J,F,Q,S,E,X,I,B,U,K,R,A,M,W,Z,G,D,L,H,C,N,Y,O,V,T,P
J,F,Q,S,E,X,I,B,U,R,M,W,G,D,H,C,N,Y,O,V,

C:\test>junk
AJORYZ
L,J,E,M,X,T,A,Z,G,U,W,Y,Q,R,C,K,O,P,V,I,D,H,N,S,F,B
L,E,M,X,T,G,U,W,Q,C,K,P,V,I,D,H,N,S,F,B
[download]

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

In reply to Re: 15 billion row text file and row deletes - Best Practice? by BrowserUk
in thread 15 billion row text file and row deletes - Best Practice? by awohld

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.