bpthatsme has asked for the wisdom of the Perl Monks concerning the following question:

Hello oh wise and wonderful monks!

I am having some difficulty with a small script that is designed to sort csv formatted input and run some calculations. There is no set size for the number of lines.

The input is formatted as such.

123456789,123,LOGGED IN MONKEY GET (PHP),200,OK,LOGIN PHP & JAVA PILLA +RS 1-1,text,true,12345,123

Note that this is a single entry, the entire file will have potentially dozens of entries. Also important, the area "LOGGED IN MONKEY GET (PHP)" will change and can include any number of potential categories.

The scope is to grab the values from (assuming a split on comma, ie - 0,1,2,3,4,5,6,7,8,9,10) 1,3, & 8.

It will be necessary to apply a regex to tear just the 'MONKEY' portion out from column 3.

The calculation will run by first counting the number of instances of column 3. Then from there count the number of 'False' from column 8. Finally for that set it will run some calculations for avg, median, etc based the values in column 1.

I am still new to data structures and having a hard time of writing this with any sense of scalability. (I can approach it by searching for matches on column 3, but given that the number of potential strings in that column is in the hundreds, it does not make sense.) Any tips that would point me in the right direction would be uber appreciated.

Thanks so much as always for your guidance!

-bp-

Replies are listed 'Best First'.
Re: Hashes, Arrays, and Confusion -- In a bit over my head!
by Riales (Hermit) on Mar 29, 2012 at 22:00 UTC

    I'd say you're on the right track. As long as you apply the regex to tear the 'MONKEY' part out of column 3 before you check to see if you've seen it before, you should do just fine.

    I would make one pass through each line to build a hash with the value from column 3 as the key pointing to a hashref with num_false (a count) and col1_vals (an arrayref). Then I would go through each key of that hash and run the calculations you want to run. I'm thinking something like this (assuming your lines are in @lines):

    my %seen = (); foreach my $line (@lines) { my @values = split ',', $line; my $col1_val = $values[0]; my $key = $values[2]; my $true_or_false = $values[7]; $key =~ /^LOGGED IN (\w+) GET/; $key = $1; $seen{$key}->{num_false}++ if $true_or_false eq 'false'; push @{$seen{$key}->{col1_vals}}, $col1_val; } foreach my $key (keys %seen) { my $calcd = calculate_stuff($seen{$key}->{col1_vals}); my $num_false = $seen{$key}->{num_false}; print <<HERE; Key: $key Calc'd values: $calcd Num false: $num_false HERE }

    I just took a stab at guessing the proper regex (for example: are spaces allowed?) because I don't know what all your input looks like, but the basic idea holds.

      Thanks Riales!

      I am going to work with that a bit and update my code here for reference should anyone else stumble into it (and for criticism).

      In the meantime, further suggestions are ALWAYS welcomed!

      -bp-

        Ok, I had a few minutes to jump into this and have some more questions.

        I used the following code (pardon the ugly vim line numbers):

        183 my %seen = (); 184 185 open (FH, "< $resultsFile"); 186 my @lines = <FH>; 187 foreach my $line (@lines) { 188 my @values = split ',', $line; 189 190 my $col1_val = $values[0]; 191 my $key = $values[2]; 192 my $errors = $values[7]; 193 194 $key =~ s/LOGGED IN //; 195 $key =~ s/ GET //; 196 $key =~ s/ POST //; 197 198 # print "$key\n"; 199 200 $key = $1; 201 202 $seen{$key}->{error_count}++ if $errors eq 'false'; 203 push @{$seen{$key}->{col1_vals}}, $col1_val; 204 205 } 206 207 foreach my $key (keys %seen) { 208 my $error_count = $seen{$key}->{error_count}; 209# my $calcd = 210 print <<HERE; 211 Key: $key 212 Errors: $error_count 213 HERE 214 } 215 216 close (FH); 217 218 die;

        This does not print out any results for key or errors (lines 211 and 212). Perhaps something is being read wrong? The commented out line on 198 does return the values as intended however.

        My second question is regarding the calculations. How do I grab all of the col1_vals together in order to run an average since it is going through line by line?

        Thanks as always!

        -bp-