Aggregation of Similar Lines

tboyd has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

I have a file that contains 12 fields. Many rows appear multiple times in the file because the event date changes. I.e...

Field1 , Field2 , Field3
12/30/02 , 2584311 , 12
12/29/02 , 2584311 , 12

Since the row is basically repeated 2 times (because there are two date entries), I'd like to divide Field2 by the number of times Field1 appears.

Said another way, if I iterate over a CSV file and encounter the same line with the only difference being this one field, I'd like to capture the number of repititions and then divide by it. Eventually perhaps write the same number of lines to a new file with the Field2 being the only one that changes.

Any guidance would be much appreciated.

Merl Ponk

Comment on Aggregation of Similar Lines

Replies are listed 'Best First'.
Re: Aggregation of Similar Lines by gjb (Vicar) on Dec 30, 2002 at 22:18 UTC
You could use the line minus the date as a hash key, incrementing the value each time you encounter the line (with a different date). `my %count; while (<>) { chomp($_); $count{substr($_, 9)}++; }` [download] Now iterate a second time to divide by the appropriate values in `%count` and write those to a file. It is assumed that the lines are identical in format/whitespace and that the dates are always 8 characters wide (if not, use a regexp to extract). Hope this helps, -gjb-	[reply] [d/l] [select]
Re: Aggregation of Similar Lines by Paladin (Vicar) on Dec 30, 2002 at 22:14 UTC
You may want to create a data structure similar to the following: `$data{2584311} = [ [ "12/30/02", "12/29/02" ], [ 12, 12 ] ];` [download] That is, a hash, keyed on Field2, with the data being a list of lists of the Fields 1, and 3 (of course you will extend it to how ever many fields you need). After you create this, it is easy enough to go through it and divide by however many elements there are, and print out the data to a new file. You never state how big your file is, so I don't know if reading the entire thing into memory will be a problem or not. This example also assumes it is field 2 that you use to decide if the 2 lines are the same or not. If you use some other field or combination of fields to decide that, use it as the key.	[reply] [d/l]
Re: Aggregation of Similar Lines by jdporter (Paladin) on Dec 30, 2002 at 22:24 UTC
I'd like to divide Field2 by the number of times Field1 appears. Not sure what you mean by "divide by". I'm guessing what you really mean is you'd like the resulting file to have only unique values in field2, i.e. eliminate any repetition in field2. If so, that's certainly easy enough to do. Here's one way. Note that here I'm simply ignoring the value of field1 for the purposes of determining row uniqueness. `my %seen; while (<>) { my $key = $_; $key =~ s/^\S+\s+//; # eliminate first field unless ( $seen{$key}++ ) { print; } }` [download] That is, if we haven't seen this line before (ignoring the first field, as I said), then print it on through. You can do something else with the line, if printing isn't appropriate. jdporter The 6th Rule of Perl Club is -- There is no Rule #6.	[reply] [d/l]
Re: Aggregation of Similar Lines by poj (Abbot) on Dec 30, 2002 at 22:47 UTC
Perhaps you can adapt something along the lines of this ; `# input #12/30/02 , 2584311 , 12, etc #12/29/02 , 2584311 , 12, etc use strict; use warnings; open (IN,"in.txt") or die "in.txt : $!"; my %data=(); my %count=(); while (<IN>){ my ($date,@f) = split ',',$_; $data{$f[0]} = join ',',@f; ++$count{$f[0]}; } open (OUT,">out.txt") or die "out.txt : $!"; foreach (sort keys %data){ print OUT "$count{$_} : $data{$_}"; }` [download] poj	[reply] [d/l]