Tabulating Data Across Multiple Large Files

reds has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Tabulating Data Across Multiple Large Files by Thelonius (Priest) on Mar 22, 2003 at 22:04 UTC
The general idea of "cosequential processing" is that the files are sorted on the key or keys you are matching on. That way you only need to keep one in memory for each file at a time, although you may have to rewind if you are doing a many-to-many join. More details of what you are doing now would be helpful. One method that might be useful (hard to tell from your description) is to merge all of the files into one (using `sort` on Unix or `sort -m` if they are already sorted). Then all the keys that match will be together in the file and it is simple to write a program to process them. You may need to pre-process the files to make them suitable for this method. Updated:Yes, now that I see your data, it looks like this method would be appropriate. You can `sort -t, -k 1n,2 -k 4,5 file1 file2 file3 ... > sorted`. Then it's a simple matter to process the sorted file: `#!perl -w use strict; my @sum = (); my $prevkey =""; while (<>) { chomp; my @data = split /,/, $_; next if $data[0] == 1; # skip headers my $key = join(",", @data[0, 1, 3, 4]); if ($key eq $prevkey) { for (0 .. $#data - 5) { $sum[$_] += $data[$_ + 5] } } else { dumpsums(); $prevkey = $key; @sum = @data[5 .. $#data]; } } dumpsums(); sub dumpsums { if ($prevkey) { print "$prevkey,", join(",", @sum), "\n"; } }` [download]	[reply] [d/l] [select]
Re: Tabulating Data Across Multiple Large Files by BrowserUk (Patriarch) on Mar 22, 2003 at 23:17 UTC
Rather than loading a hundred files into memory or even trying to process 100 files in lockstep, you can use the first file to create a hash using the key fields TIME, CASE, FUSION & TYPE as the primary key, and two secondary keys: Totals, which is an array of accumulated totals of the numbers Count, a count of the number of fields accumulated and then process each of the files one at a time, matching each record against the hash by primary key accumlating the totals and counting the records contributing to them. When you've totalled them all, use a final loop to calculate the averages and output them to the report. This should keep your memory requirements and run times predictable and reasonable. As only one file will be in use at any one time. How you handle records in the subsequent files that were not in the first will depend on your needs, but you can either accumulate them or discard them as your requirements dictate. #! perl -slw use strict; use Inline::Files; use Data::Dumper; use constant TIME =>0; use constant CASE =>1; use constant ITER =>2; use constant FUSION =>3; use constant TYPE =>4; use constant TANKS =>5; use constant AFVS =>6; use constant ADAS =>7; use constant IFVS =>8; use constant UAVS =>9; use constant UNKNOWN =>10; use constant TOTAL =>11; use constant LATENCY =>12; use constant DECOYS =>13; use constant FALSENEG =>14; use constant FALSEPOS =>15; my %data; <FILE_0>; while (<FILE_0>) { local $"=','; #" my @fields = split','; $data{"@fields[TIME,CASE,FUSION,TYPE]"}{Totals} = [ @fields[TA +NKS .. FALSEPOS] ]; $data{"@fields[TIME,CASE,FUSION,TYPE]"}{Count} = 1; } <FILE_1>; while (<FILE_1>) { local $"=','; #" my @fields = split','; if (exists $data{"@fields[TIME,CASE,FUSION,TYPE]"}) { $data{"@fields[TIME,CASE,FUSION,TYPE]"}{Totals}[$_ - TANKS +] += $fields[$_]\|\|0 for TANKS .. FALSEPOS; $data{"@fields[TIME,CASE,FUSION,TYPE]"}{Count}++; } #else create a new record if that is the requirement. } print "Time,Case,Fusion,Type, Tanks,AFVs,ADAs,IFVs,UAVS,Unknown,Total, Latency, +Decoys ,FalseNeg, FalsePos\n"; for my $key (keys %data) { printf '%35s:' . ('%5.2f 'x 11) . $/, $key, map{ $data{$key}{Totals}[$_ - TANKS] /$data{$key}{Count} } TANKS .. FALSEPOS; } __FILE_0__ 1,Case,Iter,Fusion,Type,Tanks,AFVs,ADAs,IFVs,UAVS,Unknown,Total, Laten +cy, Decoys ,FalseNeg, FalsePos 32,A2,1,UE_Battle_Bde,TRUTH,6,3,7,8,5,8,7,4,0 32,A2,1,UE_Battle_Bde,PERCEIVED,3,4,2,3,1,2,8,4,4,9,0 32,A2,1,UE_Battle_Bde,FREQUENCIES,7,3,7,6,0 32,A2,1,UA1,TRUTH,0,8,6,2,3,2,1,2,0 32,A2,1,UA1,PERCEIVED,2,6,5,9,2,1,1,4,7,8,0 32,A2,1,UA1,FREQUENCIES,6,4,1,1,0 35,A2,1,UE_Battle_Bde,TRUTH,8,7,3,9,6,1,9,3,0 35,A2,1,UE_Battle_Bde,PERCEIVED,2,9,6,8,7,2,5,2,2,8,0 35,A2,1,UE_Battle_Bde,FREQUENCIES,0,3,4,0,0 __FILE_1__ 1,Case,Iter,Fusion,Type,Tanks,AFVs,ADAs,IFVs,UAVS,Unknown,Total, Laten +cy, Decoys ,FalseNeg, FalsePos 32,A2,1,UE_Battle_Bde,TRUTH,2,5,1,8,4,8,0,1,0 32,A2,1,UE_Battle_Bde,PERCEIVED,0,9,6,8,1,7,3,6,9,7,0 32,A2,1,UE_Battle_Bde,FREQUENCIES,1,0,8,4,0 32,A2,1,UA1,TRUTH,8,2,3,2,1,4,8,3,0 32,A2,1,UA1,PERCEIVED,3,9,1,6,7,3,4,2,6,0,0 32,A2,1,UA1,FREQUENCIES,6,3,2,6,0 35,A2,1,UE_Battle_Bde,TRUTH,1,1,2,6,5,0,7,3,0 35,A2,1,UE_Battle_Bde,PERCEIVED,5,6,2,0,3,2,7,6,5,6,0 35,A2,1,UE_Battle_Bde,FREQUENCIES,6,7,1,2,0 __OUTPUT__ C:\test>245214 Time,Case,Fusion,Type, Tanks,AFVs,ADAs,IFVs,UAVS,Unkn +own,Total, Latency, Decoys ,FalseNeg, FalsePos 32,A2,UE_Battle_Bde,PERCEIVED: 1.50 6.50 4.00 5.50 1.00 4.5 +0 5.50 5.00 6.50 8.00 0.00 32,A2,UE_Battle_Bde,TRUTH: 4.00 4.00 4.00 8.00 4.50 8.0 +0 3.50 2.50 0.00 0.00 0.00 32,A2,UA1,FREQUENCIES: 6.00 3.50 1.50 3.50 0.00 0.0 +0 0.00 0.00 0.00 0.00 0.00 35,A2,UE_Battle_Bde,PERCEIVED: 3.50 7.50 4.00 4.00 5.00 2.0 +0 6.00 4.00 3.50 7.00 0.00 35,A2,UE_Battle_Bde,FREQUENCIES: 3.00 5.00 2.50 1.00 0.00 0.0 +0 0.00 0.00 0.00 0.00 0.00 32,A2,UA1,PERCEIVED: 2.50 7.50 3.00 7.50 4.50 2.0 +0 2.50 3.00 6.50 4.00 0.00 32,A2,UA1,TRUTH: 4.00 5.00 4.50 2.00 2.00 3.0 +0 4.50 2.50 0.00 0.00 0.00 35,A2,UE_Battle_Bde,TRUTH: 4.50 4.00 2.50 7.50 5.50 0.5 +0 8.00 3.00 0.00 0.00 0.00 32,A2,UE_Battle_Bde,FREQUENCIES: 4.00 1.50 7.50 5.00 0.00 0.0 +0 0.00 0.00 0.00 0.00 0.00 C:\test> [download] Examine what is said, not who speaks. 1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong. 2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible 3) Any sufficiently advanced technology is indistinguishable from magic. Arthur C. Clarke.	[reply] [d/l]
Re: Tabulating Data Across Multiple Large Files by Zaxo (Archbishop) on Mar 22, 2003 at 22:47 UTC
I suspect you would improve performance by reading the data one line at a time from each file. Your strategy of collecting all the data at once runs the risk of driving the machine into swap, with the attendant time consumption of I/O with the swap file system. You can produce an array of open file handles, one for each data source, and place the first line of each in another array. You may need to adjust the number of open file handles allowed to you for this. Then, keeping track of which index provided each line, sort by timestamp. Extract data from the soonest line, run it through its processing to update the statistics you want, and replace that line from the corresponding file handle. Sort again in some way (an insertion sort may be called for) and, as you say, lather, rinse and repeat. You should a give more specific description of your data and the statistics you want. With that we could give better advice on the streaming you want. I see that you have done that, but I'm sorry to admit that I still don't understand how you need to parse times out of that. After Compline, Zaxo	[reply]
Re: Tabulating Data Across Multiple Large Files by jdporter (Paladin) on Mar 22, 2003 at 23:13 UTC
I'm still unclear on some things. Is there actually anything special about the timestep field? Does it actually make any difference which rows are in which files? In my solution following, I am assuming the answer to both questions is "no". my @fields = qw( 1 Case Iter Fusion Type Tanks AFVs ADAs IFVs UAVS Unknown Total Latency Decoys FalseNeg FalsePos ); my @key_fields = @fields[0,1,3,4]; my @data_fields = @fields[2,5..15]; my %n; # key="@key_field_vals"; val = count my %r; # key="@key_field_vals"; val = hashref: key=field, val=sum (a +nd later, average) while (<>) # read all files, in sequential order (not in parallel) { chomp; my %rec; @rec{@fields} = split /,/; my $key = join ",", @rec{@key_fields}; $n{$key}++; for my $f ( @data_fields ) { $r{$key}{$f} += $rec{$f}; } } # now each $r{$key}{$f} is the sum of that column for that key # convert them to averages. for my $key ( keys %r ) { for my $f ( sort keys %{$r{$key}} ) { $r{$key}{$f} /= $n{$key}; } } # now you can convert the results to normal-looking records: my @averages; # one per unique key-vector value. for ( keys %r ) { my %rec; @rec{ @key_fields } = split /,/; @rec{ keys %{$r{$_}} } = values %{$r{$_}}; push @averages, \%rec; } # now @averages is an array of records that look exactly like # the data rows you read in, except that the data column values # are averages, and the key field value vectors are unique. [download] (Caution: Untested.) jdporter The 6th Rule of Perl Club is -- There is no Rule #6.	[reply] [d/l]
Re: Tabulating Data Across Multiple Large Files by Aristotle (Chancellor) on Mar 22, 2003 at 21:50 UTC
Yes to your last question. As for your task - I'm not sure what to tell you, given that you show us neither sample code nor data. Is file 0 the only file you're looking in for what data to look for? Ie is it your "control"/config file? If so, it shouldn't need anything but a doubly nested `while` loop (in a naive approach anyway). Getting more fancy, you would probably load the data from the control file into a hash and use it as reference, while linearly processing the rest of your files in a sinlge loop. Makeshifts last the longest.	[reply]
Re: Re: Tabulating Data Across Multiple Large Files by reds (Novice) on Mar 22, 2003 at 21:56 UTC
The data looks like this: 1,Case,Iter,Fusion,Type,Tanks,AFVs,ADAs,IFVs,UAVS,Unknown,Total, Latency, Decoys ,FalseNeg, FalsePos 32,A2,1,UE_Battle_Bde,TRUTH,0,0,0,0,0,0,0,0,0 32,A2,1,UE_Battle_Bde,PERCEIVED,0,0,0,0,0,0,0,0,0,0,0 32,A2,1,UE_Battle_Bde,FREQUENCIES,0,0,0,0,0 32,A2,1,UA1,TRUTH,0,0,0,0,0,0,0,0,0 32,A2,1,UA1,PERCEIVED,0,0,0,0,0,0,0,0,0,0,0 32,A2,1,UA1,FREQUENCIES,0,0,0,0,0 35,A2,1,UE_Battle_Bde,TRUTH,0,0,0,0,0,0,0,0,0 35,A2,1,UE_Battle_Bde,PERCEIVED,0,0,0,0,0,0,0,0,0,0,0 35,A2,1,UE_Battle_Bde,FREQUENCIES,0,0,0,0,0 Within the time steps (32 and 35 shown here), I am matching rows that share the same columns titled Case, Fusion and Type. Then, within each line that has the same Time, Case, Fusion and Type I am averaging all the other columns (0s here).	[reply]
Re^3: Tabulating Data Across Multiple Large Files by Aristotle (Chancellor) on Mar 22, 2003 at 22:02 UTC
Are the files guaranteed to be sorted by time steps? That would make things pretty easy. Also, the question about what function that file 0 has still stands. Makeshifts last the longest.	[reply]
Re^4: Tabulating Data Across Multiple Large Files by reds (Novice) on Mar 23, 2003 at 01:18 UTC
Re: Tabulating Data Across Multiple Large Files by CountZero (Bishop) on Mar 23, 2003 at 08:41 UTC
I would say that this task is best solved by using a database. Selecting, grouping, averaging, ... are functions found natively in most databases and if you have a mySQL or similar database server lying idle nearby, I should give it a try. Perl's DBI/DBD combo can be put to good use to import the raw data and then to extract what you need. CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law	[reply]
Re: Tabulating Data Across Multiple Large Files by Jaap (Curate) on Mar 22, 2003 at 21:54 UTC
It would help if you posted some snippets of the code. General advice: Don't use splice unless you have to, it's quite expensive. (perhaps you could just make that array element empty)	[reply]
Re: Tabulating Data Across Multiple Large Files by CountZero (Bishop) on Mar 23, 2003 at 08:47 UTC
Not that I want to be too curious, but the data seems to be vaguely military related. Perhaps some radar- or other sensor data? CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law	[reply]