Unless you have a second disk in which to write the output file, I wouldn't use two files.

Besides the huge intermediate storage requirement which can be a problem for smaller systems, writing (and the OS locating space to write to) a disk, at the same time as you are reading from that disk can severly slow down the process. Especially if the disk is more than say 50% full and/or fragmented.

I'd use a two pass process.

  1. Pass 1 reads the file and records the start&end pairing of each record to be deleted.

    If the kill list is small, load it all in a hash.

    If it's too big for memory, load the kill list in to a hash in chunks and write the positions to another file.

    The second pass only need the positions one at a time (once they are sorted).

  2. The second pass sorts the positions and then opens two filehandles to the data file. 1 for reading, 1 writing. It then reads via the former, writing to the latter, overwriting the bits that need deleting. Finally truncating the file (via the write handle) to it's final length.

This is a little tricky to envisage, so a diagram might help.

The records to be deleted A,J,O,R,Y,Z

The initial state of the bigfile:    L,J,E,M,X,T,A,Z,G,U,W,Y,Q,R,C,K,O,P,V,I,D,H,N,S,F,B

The big file without the kill list:  L,  E,M,X,T,    G,U,W,  Q,  C,K,  P,V,I,D,H,N,S,F,B

The desired result after 'compaction': L,E,M,X,T,G,U,W,Q,C,K,P,V,I,D,H,N,S,F,B

The following code uses a ramfile with single character records and ',' as the record separator to demostrate the logic required:

#! perl -slw use strict; use Data::Dumper; use List::Util qw[ shuffle ]; ## Dummy up a hash of lines to kill. my %kills = map{ $_ => undef } ( shuffle( 'A' .. 'Z' ) )[ 0 .. 5 ]; print sort keys %kills; ## And a 'big file' my $bigfile = join ',', shuffle 'A' .. 'Z'; print $bigfile; ## Scan the bigfile recording the file positions ## where records are to be deleted. open my $fhRead, '<', \$bigfile; { local $/ = ','; my $lastPos = 0; while( <$fhRead> ) { chomp; ## Store the ranges (start & end) of each record to delete $kills{ $_ } = [ $lastPos, tell $fhRead ] if exists $kills{ $_ + }; $lastPos = tell $fhRead; } } ## Sort the ranges into ascending order my @posns = sort{ $a->[ 0 ] <=> $b->[ 0 ] } values %kills; ## Open a second write handle to the file open my $fhWrite, '+<', \$bigfile; { local $/ = ','; local $\; ## Move the file pointers for reading and writing ## to the end and start positions respectively my( $w1, $r1 ) = @{ shift @posns }; seek $fhWrite, $w1, 0; seek $fhRead, $r1, 0; while( @posns ) { ## Get the next set of positions my( $w2, $r2 ) = @{ shift @posns }; ## 'Move' the records up to the start of the next record to be + deleted print $fhWrite scalar <$fhRead> while tell( $fhRead ) < $w2; ## Advance the read head over the section to be removed. seek $fhRead, $r2, 0; } ## copy the residual remaining over print $fhWrite $_ while <$fhRead>; ## truncate the write filehandle. # truncate $fhWrite, tell $fhWrite; ## truncate doesn't work on ramfiles! $bigfile = substr $bigfile, 0, tell $fhWrite; } close $fhRead; close $fhWrite; print $bigfile; __END__ C:\test>junk HMNPRS L,R,J,Z,P,H,M,K,T,A,D,Q,B,Y,S,E,W,F,U,C,I,X,G,O,N,V L,J,Z,K,T,A,D,Q,B,Y,E,W,F,U,C,I,X,G,O,V C:\test>junk ABGKOT G,O,Q,C,U,H,P,I,R,M,S,J,V,W,X,Y,Z,D,E,A,B,L,N,T,K,F Q,C,U,H,P,I,R,M,S,J,V,W,X,Y,Z,D,E,L,N,F C:\test>junk AKLPTZ J,F,Q,S,E,X,I,B,U,K,R,A,M,W,Z,G,D,L,H,C,N,Y,O,V,T,P J,F,Q,S,E,X,I,B,U,R,M,W,G,D,H,C,N,Y,O,V, C:\test>junk AJORYZ L,J,E,M,X,T,A,Z,G,U,W,Y,Q,R,C,K,O,P,V,I,D,H,N,S,F,B L,E,M,X,T,G,U,W,Q,C,K,P,V,I,D,H,N,S,F,B

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

In reply to Re: 15 billion row text file and row deletes - Best Practice? by BrowserUk
in thread 15 billion row text file and row deletes - Best Practice? by awohld

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.