tboyd has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

I have a file that contains 12 fields. Many rows appear multiple times in the file because the event date changes. I.e...

Field1 , Field2 , Field3
12/30/02 , 2584311 , 12
12/29/02 , 2584311 , 12


Since the row is basically repeated 2 times (because there are two date entries), I'd like to divide Field2 by the number of times Field1 appears.

Said another way, if I iterate over a CSV file and encounter the same line with the only difference being this one field, I'd like to capture the number of repititions and then divide by it. Eventually perhaps write the same number of lines to a new file with the Field2 being the only one that changes.

Any guidance would be much appreciated.

Merl Ponk

Replies are listed 'Best First'.
Re: Aggregation of Similar Lines
by gjb (Vicar) on Dec 30, 2002 at 22:18 UTC

    You could use the line minus the date as a hash key, incrementing the value each time you encounter the line (with a different date).

    my %count; while (<>) { chomp($_); $count{substr($_, 9)}++; }
    Now iterate a second time to divide by the appropriate values in %count and write those to a file.

    It is assumed that the lines are identical in format/whitespace and that the dates are always 8 characters wide (if not, use a regexp to extract).

    Hope this helps, -gjb-

Re: Aggregation of Similar Lines
by Paladin (Vicar) on Dec 30, 2002 at 22:14 UTC
    You may want to create a data structure similar to the following:
    $data{2584311} = [ [ "12/30/02", "12/29/02" ], [ 12, 12 ] ];

    That is, a hash, keyed on Field2, with the data being a list of lists of the Fields 1, and 3 (of course you will extend it to how ever many fields you need). After you create this, it is easy enough to go through it and divide by however many elements there are, and print out the data to a new file.

    You never state how big your file is, so I don't know if reading the entire thing into memory will be a problem or not. This example also assumes it is field 2 that you use to decide if the 2 lines are the same or not. If you use some other field or combination of fields to decide that, use it as the key.

Re: Aggregation of Similar Lines
by jdporter (Paladin) on Dec 30, 2002 at 22:24 UTC
    I'd like to divide Field2 by the number of times Field1 appears.
    Not sure what you mean by "divide by". I'm guessing what you really mean is you'd like the resulting file to have only unique values in field2, i.e. eliminate any repetition in field2. If so, that's certainly easy enough to do.

    Here's one way. Note that here I'm simply ignoring the value of field1 for the purposes of determining row uniqueness.
    my %seen; while (<>) { my $key = $_; $key =~ s/^\S+\s+//; # eliminate first field unless ( $seen{$key}++ ) { print; } }
    That is, if we haven't seen this line before (ignoring the first field, as I said), then print it on through. You can do something else with the line, if printing isn't appropriate.

    jdporter
    The 6th Rule of Perl Club is -- There is no Rule #6.

Re: Aggregation of Similar Lines
by poj (Abbot) on Dec 30, 2002 at 22:47 UTC
    Perhaps you can adapt something along the lines of this ;
    # input #12/30/02 , 2584311 , 12, etc #12/29/02 , 2584311 , 12, etc use strict; use warnings; open (IN,"in.txt") or die "in.txt : $!"; my %data=(); my %count=(); while (<IN>){ my ($date,@f) = split ',',$_; $data{$f[0]} = join ',',@f; ++$count{$f[0]}; } open (OUT,">out.txt") or die "out.txt : $!"; foreach (sort keys %data){ print OUT "$count{$_} : $data{$_}"; }

    poj