Performing Mathematical Operation on Specific Column of text File

Bama_Perl has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Performing Mathematical Operation on Specific Column of text File by kcott (Archbishop) on May 14, 2015 at 22:27 UTC
G'day Bama_Perl, You have two major problems with this line: `my ($name, $time) = (split /\s+/, $file)[1,9];` [download] `$file` is the filehandle! You want to split the line that was read from this filehandle. That line will be in `$_`. Arrays are zero-based! To get the first and last elements, your array slice needs to specify `[0,8]` (not `[1,9]`). -- Ken	[reply] [d/l] [select]
Re^2: Performing Mathematical Operation on Specific Column of text File by Bama_Perl (Acolyte) on May 14, 2015 at 22:32 UTC
Hi Ken, Thanks for your comment. I'll be sure to split the line that was read from the filehandle. As for the second comment, I understand that in perl, the arrays are zero-based. However, in this case, the $name is in the second column (1), and the last column is 9. The first column is where "MCCCC... " is located.	[reply]
Re^3: Performing Mathematical Operation on Specific Column of text File by kcott (Archbishop) on May 14, 2015 at 23:50 UTC
"However, in this case, the $name is in the second column (1), and the last column is 9. The first column is where "MCCCC... " is located." I suspect there's something here you haven't understood but I don't know what that might be. The " `MCCC processed: ...`" and "`station, ...`" lines have already been read in the `for` loop; they're not read again in the `while` loop. The first line read in the `while` loop will be: `ZJ.GRAW -0.7964 0.0051 0.9690 0.0139 0 GRAW.BHZ 301 +.1263 -1.8041` [download] When `split` on whitespace, element zero will be "`ZJ.GRAW`". In your next `split`, element zero is the only element where you're likely to capture "`ZJ`". Your `split` pattern is wrong here but I'll assume that's progated from earlier errors (obviously, you want to split on `/[.]/` — not `/\s+/`). Beyond all these issues, `split`ting on whitespace, and then trying to recreate the line, without knowing how much whitespace originally existed will not work: you won't retain the "same formatting" you state you want. So, I suggest you sit back and think of another approach. Assuming you've provided representative data, here's how I might have tackled the logic. (Note: I'm assuming "remove the mean" indicates subtracting the mean: if not, modify the calculation in `output_recalc_zj_lines()` to suit.) #!/usr/bin/env perl use strict; use warnings; print scalar <DATA> for 1 .. 2; my @zj_lines; while (<DATA>) { if (/ \A Mean_arrival_time: \s+ ( \S+ )/x) { output_recalc_zj_lines(\@zj_lines, $1); print; last; } push @zj_lines, $_; } print <DATA>; sub output_recalc_zj_lines { my ($zj_lines, $mean) = @_; for (@$zj_lines) { s/ ( \S+ ) ( \s+ ) \z / $1 - $mean . $2 /ex; print; } } __DATA__ MCCC processed: unknown event at: Tue, 14 Oct 2014 12:02:26 CST station, mccc delay, std, cc coeff, cc std, pol , t0_times + , delay_times ZJ.GRAW -0.7964 0.0051 0.9690 0.0139 0 GRAW.BHZ 301 +.1263 -1.8041 ZJ.KNYN -0.7065 0.0072 0.9760 0.0133 0 KNYN.BHZ 30 +1.3372 -1.9249 ZJ.LEON 0.9675 0.0072 0.9548 0.0292 0 LEON.BHZ 30 +1.2611 -0.1749 ZJ.RKST -0.2061 0.0114 0.9404 0.0383 0 RKST.BHZ 30 +1.3500 -1.4374 ZJ.SHRD 0.4382 0.0051 0.9542 0.0351 0 SHRD.BHZ 30 +1.7360 -1.1791 ZJ.SPLN 0.3033 0.0051 0.9785 0.0126 0 SPLN.BHZ 30 +1.0760 -0.6541 Mean_arrival_time: 300.1187 No weighting of equations. Window: 2.23 Inset: 1.17 Shift: 0.25 Variance: 0.00645 Coefficient: 0.96215 Sample rate: 40.000 Taper: 0.28 Phase: P PDE 2013 7 15 14 6 58.00 -60.867 -25.143 31.0 0.0 7.3 [download] Output: $ pm_1126698_split_record.pl MCCC processed: unknown event at: Tue, 14 Oct 2014 12:02:26 CST station, mccc delay, std, cc coeff, cc std, pol , t0_times + , delay_times ZJ.GRAW -0.7964 0.0051 0.9690 0.0139 0 GRAW.BHZ 301 +.1263 -301.9228 ZJ.KNYN -0.7065 0.0072 0.9760 0.0133 0 KNYN.BHZ 30 +1.3372 -302.0436 ZJ.LEON 0.9675 0.0072 0.9548 0.0292 0 LEON.BHZ 30 +1.2611 -300.2936 ZJ.RKST -0.2061 0.0114 0.9404 0.0383 0 RKST.BHZ 30 +1.3500 -301.5561 ZJ.SHRD 0.4382 0.0051 0.9542 0.0351 0 SHRD.BHZ 30 +1.7360 -301.2978 ZJ.SPLN 0.3033 0.0051 0.9785 0.0126 0 SPLN.BHZ 30 +1.0760 -300.7728 Mean_arrival_time: 300.1187 No weighting of equations. Window: 2.23 Inset: 1.17 Shift: 0.25 Variance: 0.00645 Coefficient: 0.96215 Sample rate: 40.000 Taper: 0.28 Phase: P PDE 2013 7 15 14 6 58.00 -60.867 -25.143 31.0 0.0 7.3 [download] Note how the original formatting is retained exactly (including the space that is presumably missing between `ZJ.GRAW` and `-0.7964` which, if present, would have aligned that record's format with the other `ZJ` records). -- Ken	[reply] [d/l] [select]
Re: Performing Mathematical Operation on Specific Column of text File by aaron_baugher (Curate) on May 14, 2015 at 23:34 UTC
The first thing to recognize is that your program will need to see every line of the input before it begins writing the output, since it has to see all the frequencies before it can calculate their average so it can be applied to each line. So you'll need to either: 1. Loop through the input file twice, accumulating numbers on the first time through, then making the edits and printing the output on the second time through. 2. Loop through the input file once, but save each line in an array while you do the calculations, so that your second loop can go through the array instead of hitting the filesystem again. I'd say the second solution will generally be the best unless the file is so large that putting it all in an array will cause memory problems. So let's do that. This loops through the file, adding the frequencies to an accumulator ($total) when a line matches the pattern, keeping track of how many ($howmany) frequencies it adds. Those two values will be divided to get the mean. It also saves each line in an array, along with a flag ($fixlater) to show whether that line is one containing a frequency. That way when I loop through the lines again, I don't have to split the ones that don't contain a frequency to change; the other lines can just be printed out. This does literally what you said: subtracts the mean from each frequency. Since the mean is negative, subtracting it actually adds a positive, which may or may not be what you really want. If it's not, try to describe what you really want in more detail, and give us a couple examples of input and output frequencies. #!/usr/bin/env perl use 5.010; use strict; use warnings; my @lines; my $total = 0; my $howmany = 0; my $output_separator = "\t"; # your choice while(<DATA>){ chomp; my($n, $f) = (split)[0,8]; my $fixlater; if ($n =~ /\A[A-Z]{2}\.[A-Z]{4}\Z/ and $f =~ /\A-\d\.\d{4}\Z/ ){ $total += $f; $howmany++; $fixlater = 1; } push @lines, [$fixlater,$_]; } my $mean = $total/$howmany; for (@lines){ if($_->[0]){ # fix this line my @f = (split ' ', $_->[1]); $f[8] = sprintf "%0.4f", $f[8] - $mean; say join $output_separator, @f; } else { say $_->[1]; } } __DATA__ MCCC processed: unknown event at: Tue, 14 Oct 2014 12:02:26 CST station, mccc delay, std, cc coeff, cc std, pol , t0_times + , delay_times ZJ.GRAW -0.7964 0.0051 0.9690 0.0139 0 GRAW.BHZ 301 +.1263 -1.8041 ZJ.KNYN -0.7065 0.0072 0.9760 0.0133 0 KNYN.BHZ 301. +3372 -1.9249 ZJ.LEON 0.9675 0.0072 0.9548 0.0292 0 LEON.BHZ 301. +2611 -0.1749 ZJ.RKST -0.2061 0.0114 0.9404 0.0383 0 RKST.BHZ 301. +3500 -1.4374 ZJ.SHRD 0.4382 0.0051 0.9542 0.0351 0 SHRD.BHZ 301. +7360 -1.1791 ZJ.SPLN 0.3033 0.0051 0.9785 0.0126 0 SPLN.BHZ 301. +0760 -0.6541 Mean_arrival_time: 300.1187 No weighting of equations. Window: 2.23 Inset: 1.17 Shift: 0.25 Variance: 0.00645 Coefficient: 0.96215 Sample rate: 40.000 Taper: 0.28 Phase: P PDE 2013 7 15 14 6 58.00 -60.867 -25.143 31.0 0.0 7.3 [download] Aaron B. Available for small or large Perl jobs and *nix system administration; see my home node.	[reply] [d/l]
Re^2: Performing Mathematical Operation on Specific Column of text File by Bama_Perl (Acolyte) on May 15, 2015 at 19:52 UTC
Hi Aaron, I think this approach may be more complicated than it's worth. I think I am going to try another approach, in which I will loop through a list of files, extract the 9th column of times, sum the ninth column, provide a counter to count the number of lines in the column that match the conditions I need, then find the mean by taking the total(sum) and dividing by that counter. The logic is provided below: `$total = 0; $count = 0; for ($j = 2; $j < @tableb; $j++) { chomp ($tableb[$j]); ($netsta,$delay_time) = (split /\s+/,$tableb[$j])[1,9]; ($net,$sta) = (split /\./, $netsta)[0,1]; if ($net eq "ZJ") { $count = $count + 1; $total = $total + $delay_time; $mean = $total/$count; print $mean, "\n"; }` [download] The for loop is looping through a file called $tableb, and if $net in the first column equals "ZJ", add to the counter, and then add the delay_time. Then I want to get the mean, and I output the mean. When printing out the mean, I get: `-0.9188 -1.0063 -0.585466666666667 -0.705775 -0.80838 -0.80595 -0.722071428571429 -0.6714 -0.773888888888889 -0.84067 -0.9097 -0.958375 -0.7386 -0.7877 -0.784433333333333 -0.69155 -0.78836 -0.779766666666667 -0.820314285714286 -0.8476 -0.802544444444444 -0.88008 -0.9104 -0.916916666666666 -0.962815384615385 -1.0093` [download] where the LAST value before each line break (-0.958375 and -1.0093) are the means that I need -- they are the total means. Now the question I have is, how do I extract that last value, set it to a variable, and then later on, subtract it from the $delay_time when I need to print it out (which isn't provided here)? TLDR: When printing out the means, the means iteratively add up, and then I need to extract the final mean once the for loop is finished looping through each column. That final mean (the total mean) will then be sent to a variable to be used for subtraction purposes later. Does that make sense? I apologize if it's not clear. One option I found was using the -1 option to extract the last line of each output. Would that work?	[reply] [d/l] [select]
Re^3: Performing Mathematical Operation on Specific Column of text File by aaron_baugher (Curate) on May 16, 2015 at 02:07 UTC
I'm not clear on everything you're trying to do. But to your main question: how to save the last calculated mean so it can be used after the loop, just make sure you declare the variable outside the loop before it starts, like this: `my $mean; my $total = 0; my $count = 0; for ($j = 2; $j < @tableb; $j++) { # do other calcuations if(this_line_matches()){ $mean = $title/$count; # use $mean from outside the loop } } print $mean; # now contains the last value calculated inside the loop` [download] If you don't actually need to calculate the mean for each loop, you could move that calculation to after the loop and only do it once. Aaron B. Available for small or large Perl jobs and *nix system administration; see my home node.	[reply] [d/l]
Re: Performing Mathematical Operation on Specific Column of text File by NetWallah (Canon) on May 14, 2015 at 22:58 UTC
`>perl -ane "next unless $F[0]=~/^ZJ/;$count++;$val+=$F[8]}{printf qq\|% +d values; mean=%.4f\n\|,$count,$val/$count" source.file.name.txt #OUTPUT: 6 values; mean=-1.1958` [download] Use single-quotes for Linux. Format, and redirect output as desired. see "perl --help" for the meanings of -a, -n, -e, and @F "}{" is the perl eskimo greeting. "You're only given one little spark of madness. You mustn't lose it." - Robin Williams	[reply] [d/l]