reds has asked for the wisdom of the Perl Monks concerning the following question:

Greetings. I would describe myself as an intermediate Perl coder, although I am sure that is subject to much interpretation. Anyhow, my question is one of optimization.

The scenario: I have quite a few (up to around 100) large text files (400K-800K). Each contains time series data organized by row. The job is to find matching "types" of lines across the files and then calculate averages of their data columns and put them in an output file.

The process that I have in place now goes like this:
  1. Bring all of the files into memory as arrays within one central organizing array.
    i.e. $results[$rep] = [<RESULTS>];
  2. Look at line 0 of file 0 and find entries in each of the other files that match the time step and type columns.
  3. When you have the list of lines, calculate the average of their data columns and output these results to a file
  4. Remove these lines from their arrays (splice)
  5. Move onto line 1 of file 0
  6. Lather, rinse & repeat

This only works because I haven't been testing with very many of these text files (I've only been bringing about 6 into memory at a time - and even this is quite arduous).

What I would like to have is some sort of streaming file reader. That way I could stream through each of the 100 files in parallel, calculating averages as I go. This would all but eliminate the time it takes to drag these files into memory to be worked on, as well as reduce the memory footprint of the process.

The question is, how can I stream through many files in parallel when their structures aren't exactly parallel (some files have different time step sequences, or different "type" lines within a given time step)?

Also, as an aside, is the Benchmark module the standard tool to time code?

Replies are listed 'Best First'.
Re: Tabulating Data Across Multiple Large Files
by Thelonius (Priest) on Mar 22, 2003 at 22:04 UTC
    The general idea of "cosequential processing" is that the files are sorted on the key or keys you are matching on. That way you only need to keep one in memory for each file at a time, although you may have to rewind if you are doing a many-to-many join. More details of what you are doing now would be helpful.

    One method that might be useful (hard to tell from your description) is to merge all of the files into one (using sort on Unix or sort -m if they are already sorted). Then all the keys that match will be together in the file and it is simple to write a program to process them. You may need to pre-process the files to make them suitable for this method.

    Updated:Yes, now that I see your data, it looks like this method would be appropriate. You can sort -t, -k 1n,2 -k 4,5 file1 file2 file3 ... > sorted. Then it's a simple matter to process the sorted file:

    #!perl -w use strict; my @sum = (); my $prevkey =""; while (<>) { chomp; my @data = split /,/, $_; next if $data[0] == 1; # skip headers my $key = join(",", @data[0, 1, 3, 4]); if ($key eq $prevkey) { for (0 .. $#data - 5) { $sum[$_] += $data[$_ + 5] } } else { dumpsums(); $prevkey = $key; @sum = @data[5 .. $#data]; } } dumpsums(); sub dumpsums { if ($prevkey) { print "$prevkey,", join(",", @sum), "\n"; } }
Re: Tabulating Data Across Multiple Large Files
by BrowserUk (Patriarch) on Mar 22, 2003 at 23:17 UTC

    Rather than loading a hundred files into memory or even trying to process 100 files in lockstep, you can use the first file to create a hash using the key fields TIME, CASE, FUSION & TYPE as the primary key, and two secondary keys:

    1. Totals, which is an array of accumulated totals of the numbers
    2. Count, a count of the number of fields accumulated

    and then process each of the files one at a time, matching each record against the hash by primary key accumlating the totals and counting the records contributing to them.

    When you've totalled them all, use a final loop to calculate the averages and output them to the report.

    This should keep your memory requirements and run times predictable and reasonable. As only one file will be in use at any one time.

    How you handle records in the subsequent files that were not in the first will depend on your needs, but you can either accumulate them or discard them as your requirements dictate.

    #! perl -slw use strict; use Inline::Files; use Data::Dumper; use constant TIME =>0; use constant CASE =>1; use constant ITER =>2; use constant FUSION =>3; use constant TYPE =>4; use constant TANKS =>5; use constant AFVS =>6; use constant ADAS =>7; use constant IFVS =>8; use constant UAVS =>9; use constant UNKNOWN =>10; use constant TOTAL =>11; use constant LATENCY =>12; use constant DECOYS =>13; use constant FALSENEG =>14; use constant FALSEPOS =>15; my %data; <FILE_0>; while (<FILE_0>) { local $"=','; #" my @fields = split','; $data{"@fields[TIME,CASE,FUSION,TYPE]"}{Totals} = [ @fields[TA +NKS .. FALSEPOS] ]; $data{"@fields[TIME,CASE,FUSION,TYPE]"}{Count} = 1; } <FILE_1>; while (<FILE_1>) { local $"=','; #" my @fields = split','; if (exists $data{"@fields[TIME,CASE,FUSION,TYPE]"}) { $data{"@fields[TIME,CASE,FUSION,TYPE]"}{Totals}[$_ - TANKS +] += $fields[$_]||0 for TANKS .. FALSEPOS; $data{"@fields[TIME,CASE,FUSION,TYPE]"}{Count}++; } #else create a new record if that is the requirement. } print "Time,Case,Fusion,Type, Tanks,AFVs,ADAs,IFVs,UAVS,Unknown,Total, Latency, +Decoys ,FalseNeg, FalsePos\n"; for my $key (keys %data) { printf '%35s:' . ('%5.2f 'x 11) . $/, $key, map{ $data{$key}{Totals}[$_ - TANKS] /$data{$key}{Count} } TANKS .. FALSEPOS; } __FILE_0__ 1,Case,Iter,Fusion,Type,Tanks,AFVs,ADAs,IFVs,UAVS,Unknown,Total, Laten +cy, Decoys ,FalseNeg, FalsePos 32,A2,1,UE_Battle_Bde,TRUTH,6,3,7,8,5,8,7,4,0 32,A2,1,UE_Battle_Bde,PERCEIVED,3,4,2,3,1,2,8,4,4,9,0 32,A2,1,UE_Battle_Bde,FREQUENCIES,7,3,7,6,0 32,A2,1,UA1,TRUTH,0,8,6,2,3,2,1,2,0 32,A2,1,UA1,PERCEIVED,2,6,5,9,2,1,1,4,7,8,0 32,A2,1,UA1,FREQUENCIES,6,4,1,1,0 35,A2,1,UE_Battle_Bde,TRUTH,8,7,3,9,6,1,9,3,0 35,A2,1,UE_Battle_Bde,PERCEIVED,2,9,6,8,7,2,5,2,2,8,0 35,A2,1,UE_Battle_Bde,FREQUENCIES,0,3,4,0,0 __FILE_1__ 1,Case,Iter,Fusion,Type,Tanks,AFVs,ADAs,IFVs,UAVS,Unknown,Total, Laten +cy, Decoys ,FalseNeg, FalsePos 32,A2,1,UE_Battle_Bde,TRUTH,2,5,1,8,4,8,0,1,0 32,A2,1,UE_Battle_Bde,PERCEIVED,0,9,6,8,1,7,3,6,9,7,0 32,A2,1,UE_Battle_Bde,FREQUENCIES,1,0,8,4,0 32,A2,1,UA1,TRUTH,8,2,3,2,1,4,8,3,0 32,A2,1,UA1,PERCEIVED,3,9,1,6,7,3,4,2,6,0,0 32,A2,1,UA1,FREQUENCIES,6,3,2,6,0 35,A2,1,UE_Battle_Bde,TRUTH,1,1,2,6,5,0,7,3,0 35,A2,1,UE_Battle_Bde,PERCEIVED,5,6,2,0,3,2,7,6,5,6,0 35,A2,1,UE_Battle_Bde,FREQUENCIES,6,7,1,2,0 __OUTPUT__ C:\test>245214 Time,Case,Fusion,Type, Tanks,AFVs,ADAs,IFVs,UAVS,Unkn +own,Total, Latency, Decoys ,FalseNeg, FalsePos 32,A2,UE_Battle_Bde,PERCEIVED: 1.50 6.50 4.00 5.50 1.00 4.5 +0 5.50 5.00 6.50 8.00 0.00 32,A2,UE_Battle_Bde,TRUTH: 4.00 4.00 4.00 8.00 4.50 8.0 +0 3.50 2.50 0.00 0.00 0.00 32,A2,UA1,FREQUENCIES: 6.00 3.50 1.50 3.50 0.00 0.0 +0 0.00 0.00 0.00 0.00 0.00 35,A2,UE_Battle_Bde,PERCEIVED: 3.50 7.50 4.00 4.00 5.00 2.0 +0 6.00 4.00 3.50 7.00 0.00 35,A2,UE_Battle_Bde,FREQUENCIES: 3.00 5.00 2.50 1.00 0.00 0.0 +0 0.00 0.00 0.00 0.00 0.00 32,A2,UA1,PERCEIVED: 2.50 7.50 3.00 7.50 4.50 2.0 +0 2.50 3.00 6.50 4.00 0.00 32,A2,UA1,TRUTH: 4.00 5.00 4.50 2.00 2.00 3.0 +0 4.50 2.50 0.00 0.00 0.00 35,A2,UE_Battle_Bde,TRUTH: 4.50 4.00 2.50 7.50 5.50 0.5 +0 8.00 3.00 0.00 0.00 0.00 32,A2,UE_Battle_Bde,FREQUENCIES: 4.00 1.50 7.50 5.00 0.00 0.0 +0 0.00 0.00 0.00 0.00 0.00 C:\test>

    Examine what is said, not who speaks.
    1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.
    2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible
    3) Any sufficiently advanced technology is indistinguishable from magic.
    Arthur C. Clarke.
Re: Tabulating Data Across Multiple Large Files
by Zaxo (Archbishop) on Mar 22, 2003 at 22:47 UTC

    I suspect you would improve performance by reading the data one line at a time from each file. Your strategy of collecting all the data at once runs the risk of driving the machine into swap, with the attendant time consumption of I/O with the swap file system.

    You can produce an array of open file handles, one for each data source, and place the first line of each in another array. You may need to adjust the number of open file handles allowed to you for this. Then, keeping track of which index provided each line, sort by timestamp. Extract data from the soonest line, run it through its processing to update the statistics you want, and replace that line from the corresponding file handle. Sort again in some way (an insertion sort may be called for) and, as you say, lather, rinse and repeat.

    You should a give more specific description of your data and the statistics you want. With that we could give better advice on the streaming you want. I see that you have done that, but I'm sorry to admit that I still don't understand how you need to parse times out of that.

    After Compline,
    Zaxo

Re: Tabulating Data Across Multiple Large Files
by jdporter (Paladin) on Mar 22, 2003 at 23:13 UTC
    I'm still unclear on some things.
    1. Is there actually anything special about the timestep field?
    2. Does it actually make any difference which rows are in which files?
    In my solution following, I am assuming the answer to both questions is "no".
    my @fields = qw( 1 Case Iter Fusion Type Tanks AFVs ADAs IFVs UAVS Unknown Total Latency Decoys FalseNeg FalsePos ); my @key_fields = @fields[0,1,3,4]; my @data_fields = @fields[2,5..15]; my %n; # key="@key_field_vals"; val = count my %r; # key="@key_field_vals"; val = hashref: key=field, val=sum (a +nd later, average) while (<>) # read all files, in sequential order (not in parallel) { chomp; my %rec; @rec{@fields} = split /,/; my $key = join ",", @rec{@key_fields}; $n{$key}++; for my $f ( @data_fields ) { $r{$key}{$f} += $rec{$f}; } } # now each $r{$key}{$f} is the sum of that column for that key # convert them to averages. for my $key ( keys %r ) { for my $f ( sort keys %{$r{$key}} ) { $r{$key}{$f} /= $n{$key}; } } # now you can convert the results to normal-looking records: my @averages; # one per unique key-vector value. for ( keys %r ) { my %rec; @rec{ @key_fields } = split /,/; @rec{ keys %{$r{$_}} } = values %{$r{$_}}; push @averages, \%rec; } # now @averages is an array of records that look exactly like # the data rows you read in, except that the data column values # are averages, and the key field value vectors are unique.
    (Caution: Untested.)

    jdporter
    The 6th Rule of Perl Club is -- There is no Rule #6.

Re: Tabulating Data Across Multiple Large Files
by Aristotle (Chancellor) on Mar 22, 2003 at 21:50 UTC

    Yes to your last question.

    As for your task - I'm not sure what to tell you, given that you show us neither sample code nor data. Is file 0 the only file you're looking in for what data to look for? Ie is it your "control"/config file? If so, it shouldn't need anything but a doubly nested while loop (in a naive approach anyway).

    Getting more fancy, you would probably load the data from the control file into a hash and use it as reference, while linearly processing the rest of your files in a sinlge loop.

    Makeshifts last the longest.

      The data looks like this:
      1,Case,Iter,Fusion,Type,Tanks,AFVs,ADAs,IFVs,UAVS,Unknown,Total, Latency, Decoys
      ,FalseNeg, FalsePos
      32,A2,1,UE_Battle_Bde,TRUTH,0,0,0,0,0,0,0,0,0
      32,A2,1,UE_Battle_Bde,PERCEIVED,0,0,0,0,0,0,0,0,0,0,0
      32,A2,1,UE_Battle_Bde,FREQUENCIES,0,0,0,0,0
      32,A2,1,UA1,TRUTH,0,0,0,0,0,0,0,0,0
      32,A2,1,UA1,PERCEIVED,0,0,0,0,0,0,0,0,0,0,0
      32,A2,1,UA1,FREQUENCIES,0,0,0,0,0
      35,A2,1,UE_Battle_Bde,TRUTH,0,0,0,0,0,0,0,0,0
      35,A2,1,UE_Battle_Bde,PERCEIVED,0,0,0,0,0,0,0,0,0,0,0
      35,A2,1,UE_Battle_Bde,FREQUENCIES,0,0,0,0,0
      

      Within the time steps (32 and 35 shown here), I am matching rows that share the same columns titled Case, Fusion and Type. Then, within each line that has the same Time, Case, Fusion and Type I am averaging all the other columns (0s here).

        Are the files guaranteed to be sorted by time steps? That would make things pretty easy. Also, the question about what function that file 0 has still stands.

        Makeshifts last the longest.

Re: Tabulating Data Across Multiple Large Files
by CountZero (Bishop) on Mar 23, 2003 at 08:41 UTC

    I would say that this task is best solved by using a database. Selecting, grouping, averaging, ... are functions found natively in most databases and if you have a mySQL or similar database server lying idle nearby, I should give it a try.

    Perl's DBI/DBD combo can be put to good use to import the raw data and then to extract what you need.

    CountZero

    "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

Re: Tabulating Data Across Multiple Large Files
by Jaap (Curate) on Mar 22, 2003 at 21:54 UTC
    It would help if you posted some snippets of the code.

    General advice:
    Don't use splice unless you have to, it's quite expensive. (perhaps you could just make that array element empty)
Re: Tabulating Data Across Multiple Large Files
by CountZero (Bishop) on Mar 23, 2003 at 08:47 UTC

    Not that I want to be too curious, but the data seems to be vaguely military related. Perhaps some radar- or other sensor data?

    CountZero

    "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law