Accessing files at certain line number

Utrecht has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks! I have question concerning the following:

I am have 49 very long ascii-files (hundreds of thousands of lines). For each of this 49 files, I want to cp the first 1000 lines (1024 exactly) to be copied to a new file (say temp files 1 till 49). Subsequently use these files (processing stuff) and than cp the second 1000 rules from each file to 49 temp files, and so on and on till I've reached the end ot the long ascii-files (all having exact the same lengt).

I already have some code working, but the problem is, it is much too slow... In the following code (found on some man-page), for your understanding: $span is a constant @stations has length 49 $lower_bound gets increased by value of 1024 just as $upper_bound for every while loop ($nr_of_samples = 1024). Furthermore, $length is the total number of lines of the 49 huge ascii files.

As you can see for every line (loop from $lower_bound to $upper_bound), the files need to be openened, closed and both subs need to be called. There must be a faster way I think, so some advice would be higly appreciated :)

  while(){
    if($cnt % $nr_samples == 0){
      $lower_bound = $lower_bound+$nr_samples;
      $upper_bound = $upper_bound+$nr_samples;
    }
    $cnt++;
    last if $upper_bound > ($length-1);
    foreach my $station(@stations){
      open OUT, ">$span.$station.alpha.sac.data";
      print OUT "2 $nr_samples\n";
      for my $seeking($lower_bound..$upper_bound){
        my $eval_file_2 = $suffix ? sprintf"%s%s_%s_%s",$prefix,$suffi
+x,$station,$span : sprintf"%s_%s_%s",$prefix,$station,$span;
        open(FILE, "< $eval_file_2") or die "Can't open $eval_file_2 f
+or reading: $!\n";
        open(INDEX, "+>$eval_file_2.idx") or die "Can't open $eval_fil
+e_2.idx for read/write: $!\n";
        build_index(*FILE, *INDEX);
        my $line = line_with_index(*FILE, *INDEX, $seeking);
        close FILE;
        close INDEX;
        chomp $line;
        my($time,$value)=split(/\s+/,$line);
        printf OUT "%.3f %.10f\n",$time,$value;
      }
      close OUT;
    }
  }
[download]

The subroutines:


sub build_index {
    my $data_file   = shift;
    my $index_file  = shift;
    my $offset      = 0;

    while (<$data_file>) {
        print $index_file pack("N", $offset);
        $offset = tell($data_file);
    }
}


sub line_with_index {
    my $data_file   = shift;
    my $index_file  = shift;
    my $line_number = shift;

    my $size;               # size of an index entry
    my $i_offset;           # offset into the index of the entry
    my $entry;              # index entry
    my $d_offset;           # offset into the data file

    $size = length(pack("N", 0));
    $i_offset = $size * ($line_number-1);
    seek($index_file, $i_offset, 0) or return;
    read($index_file, $entry, $size);
    $d_offset = unpack("N", $entry);
    seek($data_file, $d_offset, 0);
    return scalar(<$data_file>);
}
[download]

Comment on Accessing files at certain line number Select or Download Code

Replies are listed 'Best First'.
Re: Accessing files at certain line number by ikegami (Patriarch) on Sep 21, 2009 at 14:05 UTC
There's already a tool for that: split But you're right about the slowdown being the fact that you're constantly opening files. `open(my $in_fh, '<', $in_qfn) or die("Can't open input file \"$in_qfn\": $!\n"); my $out_fh; my $file_num = 0; while (<$in_fh>) { if ($. % $max_file_size == 1) { undef $out_fh; } if (!defined($out_fh)) { my $out_qfn = sprintf('part%04d', $file_num++); open($out_fh, '<', $out_qfn) or die("Can't create output file \"$out_qfn\": $!\n"); } print $out_fh $_; }` [download]	[reply] [d/l]
Re: Accessing files at certain line number by Corion (Patriarch) on Sep 21, 2009 at 14:08 UTC
I see two possibilities for optimizing the speed of your program by reducing the number of file accesses it makes: Load your index into memory instead of reading it from disk every time. You do one `seek` call and one `read` call on your index file per line read - you can reduce that number to one `read` and one `unpack` overall, at the price of some memory. Also, you don't really need to store the offsets of each line but only the offsets of each 1024th line, or, if your program advances through the file anyway, only the offset after which you want to continue. Instead of `seek`ing in your data file for every line, just seek once to your start point and then read the 1024 lines from that point. This will save you another call to `seek` for every line read. In addition to these two points, you might want to consider if you actually need exactly 1024 lines per batch or if it is OK to use "roughly" 1024 lines per batch. Then you can simply read the first (say) 10_000 lines and use their average length to split up the file into batches of roughly 1024 lines. Whenever you end up in the middle of a line with the start of your batch, you move the start in the direction of the beginning of the file, and the same with the end position of your batch. This will save you the need of reading through the lines just for counting them, but that might or might not be an overall speed gain, since you will need to read the whole file line by line at least once anyway.	[reply] [d/l] [select]
Re: Accessing files at certain line number by Fletch (Bishop) on Sep 21, 2009 at 14:10 UTC
Presuming some variant of NIX and a shell that groks `{00..49}` notation (zsh, and I believe bash): `mkdir out for i in infiles{01..49} ; do ( cd out ; split -l 1024 -a 5 ../$i ${i:s,infiles,,} ) done for i in $(perl -le 'print for "aaaaa".."aaazz"') ; do cat out/{01..49}$i > segment_$i done rm -rf ./out for i in segment_ ; do processing $i done` [download] Modifying to automagically determine the maximum split partition name left as an exercise for the reader. The cake is a lie. The cake is a lie. The cake is a lie.	[reply] [d/l]
Re^2: Accessing files at certain line number by ikegami (Patriarch) on Sep 21, 2009 at 15:14 UTC
`{01..49}` ⇒ `{0,1,2,3,4}{0,1,2,3,4,5,6,7,8,9}`	[reply] [d/l] [select]