divide multi-column input file into sub-files depending on specific column's value

angela2 has asked for the wisdom of the Perl Monks concerning the following question:

Hi all! I have some data that look like below, after some formatting - thanks for helping me fix my code (in another thread) and format my data nicely.

-59.077    89.301   115.664       7
-61.251    77.435   117.760      -6
-60.950    71.712   116.061      -7
-56.247    83.685   114.576       1
-59.263    76.107   112.555      -2
-59.895    65.296   111.185       3
-60.141    63.694   111.257      -3
-61.667    63.707   116.937       2
-58.722    60.429   111.307      -1 
-57.511    42.922   112.108       6
[download]

(^ 10 lines)

Now what I want to do is the following:

1- Subdivide my data file in separate data files, depending on the value of the last column as an absolute number. As an example, from the input above I want to get the following files:

File 1: output_7.txt:

 7   -59.077    89.301   115.664
-7   -60.950    71.712   116.061
[download]

File 2: output_6.txt:

 -6  -61.251    77.435   117.760  
  6  -57.511    42.922   112.108
[download]

File 3: output_1.yxy:

-1   -58.722    60.429   111.307
 1   -56.247    83.685   114.576
[download]

And so on, for all the values of the last column.

Notes:

- I don't have a fixed column number, and

- I don't have a standard list of what the last column's values are, however they always are pairs of positive and negative integers as described above.

2. I also want to print the line number, so the above outputs would be (second column in the output below is the line number from the input file above)

File 1: output_7.txt:

 7   1   -59.077    89.301   115.664
-7   3   -60.950    71.712   116.061
[download]

File 2: output_6.txt:

-6   2  -61.251    77.435   117.760  
 6  10  -57.511    42.922   112.108
[download]

File 3: output_1.yxy:

-1   9   -58.722    60.429   111.307
 1   4   -56.247    83.685   114.576
[download]

And so on.

This is what I've done so far:

Code 1, looking for a specific number, it works fine. Well, it's probably very awkwardly written but it works. As a test, I matched against value "-7" to see if my code works.

#!/usr/bin/perl
use warnings;
use strict;

open my $target, '>', "test-out-1" or die $!;

open my $FILE, '<', 'input_file' or die $!;
while (<$FILE>) {

  chomp;
  my @columns = unpack('a8 a8 a8 a6');
  #print join(" ",map {$_} @columns), "\n";
  #print "@columns[3] \n";
    
  foreach (@columns[$#columns]) {
   print "$_ \n";
      if ($_ =~ /-7/) {
     
      my $ID = $_;
      my $IDform = sprintf ("%4s", $ID);
      
      my $currentline = $.;
      my $currentlineform = sprintf ("%7s", $currentline);

      my @selection = (@columns[0..$#columns-1]);
      my $layout = "%10s"x(@selection) . "\n";
      printf $target $IDform . $currentlineform . $layout, @selection;
      }
  }
}
[download]

This is part of my output:

   -7    418   -17.459    -3.557   123.002
   -7    419   -19.119    -2.327   121.948
   -7    421   -18.172    -5.439   122.677
   -7    423   -21.239    -5.003   128.245
   -7    424   -17.575    -3.567   124.891
   -7    425   -19.519     1.088   136.199
   -7    426   -17.135    -5.042   124.510
   -7    427   -19.539    -2.356   127.619
   -7    429   -16.867     0.671   123.725
   -7    430   -19.638     8.992   126.487
   -7    431   -19.731    13.090   129.183
   -7    432   -17.846    15.834   128.342
   -7    440   -20.265    16.101   127.072
[download]

First column: the value of the input file's fourth column.

Second column: the line number where the matching 4th column pattern was, in the input file.

Rest of the columns: the rest of the columns of my input file, that correspond to the 4th column value.

Now, I want to have a code that works for all possible values of the input file's 4th column. Because they're always in positive/negative pairs, and because it's highly unlikely that they will be anything different from the range 1-7, I broadened it a bit to be safe and made an array with values from -10 to 10 (this is @match) This is in an attempt to have my code work for every possible value.

This is what I've done so far:

#!/usr/bin/perl
use warnings;
use strict;

my $match;
my @match = ();

push (@match, $match); # I fear this is what I'm doing wrong - I'm not
+ putting $match in @match correctly. 

@match = (-10, -9, -8, -7, -6, -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6
+, 7, 8, 9, 10);

my $value = abs($match); # Here I'm trying to establish a variable tha
+t will basically be the absolute value of the fourth column of my inp
+ut file, this is what I want to match.

open my $target, '>', "test-out-$value" or die $!;

open my $FILE, '<', 'input_file' or die $!;
while (<$FILE>) {

  chomp;
  my @columns = unpack('a8 a8 a8 a6');
  #print join(" ",map {$_} @columns), "\n";
  #print "@columns[3] \n";
    
  foreach (@columns[$#columns]) {

      if ($_ =~ /$value/) {
     
      my $ID = $_; 
      my $IDform = sprintf ("%4s", $ID); # Also here I'm confused and 
+I can't think about how to write what I want. I want to match the abs
+olute value (a few lines above), but print the actual value (in my ou
+tput file), not the absolute one. This worked easily in my first code
+ (where I matched for specific number) but now I can't think how to w
+rite the general version.
     
      my $currentline = $.;
      my $currentlineform = sprintf ("%7s", $currentline);##

      my @selection = (@columns[0..$#columns-1]);
      my $layout = "%10s"x(@selection) . "\n";
      printf $target $IDform . $currentlineform . $layout, @selection;
     }
  }
}
[download]

From this, I get an unitialized variable error for $match - I have tried to see how to fix this but I'm doing it wrong and can't figure out what's wrong. Also, the output I'm getting is a file of 0 size and by the title of "test-out-0".

This is a bit too much to be edited, I know, but if someone could at least let me know how I'm managing to populate the array wrong? I checked out the array functions (shift, unshift, pop, etc) and "push" seemed like the right way to go. If I get that corrected, I may be able to continue to fix the whole thing. Thank you so much for taking the time to read this and looking forward to some kindly offered hints and suggestions.

Comment on divide multi-column input file into sub-files depending on specific column's value Select or Download Code

Replies are listed 'Best First'.
Re: divide multi-column input file into sub-files depending on specific column's value by BrowserUk (Patriarch) on Jul 05, 2016 at 10:58 UTC
This does what you asked for in the first screenful of your post. I wrote and tested it before noticing there was a lot more to the post, so here it is, maybe it is useful to you: `#! perl -sw use strict; my %fhs; while( <DATA> ) { my @bits = split; my $abs = abs( $bits[ 3 ] ); open $fhs{ $abs }, '>', 'output_' . $abs or die $! unless exists $ +fhs{ $abs }; print { $fhs{ $abs } } $_; } __DATA__ -59.077 89.301 115.664 7 -61.251 77.435 117.760 -6 -60.950 71.712 116.061 -7 -56.247 83.685 114.576 1 -59.263 76.107 112.555 -2 -59.895 65.296 111.185 3 -60.141 63.694 111.257 -3 -61.667 63.707 116.937 2 -58.722 60.429 111.307 -1 -57.511 42.922 112.108 6` [download] Produces: `C:\test>type output_* output_1 -56.247 83.685 114.576 1 -58.722 60.429 111.307 -1 output_2 -59.263 76.107 112.555 -2 -61.667 63.707 116.937 2 output_3 -59.895 65.296 111.185 3 -60.141 63.694 111.257 -3 output_6 -61.251 77.435 117.760 -6 -57.511 42.922 112.108 6 output_7 -59.077 89.301 115.664 7 -60.950 71.712 116.061 -7` [download] With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :) In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re^2: divide multi-column input file into sub-files depending on specific column's value by angela2 (Sexton) on Jul 05, 2016 at 11:08 UTC
Hi, thanks for your time. I know, it was a really long post :( I was just testing your answer, it does indeed do part of what I want. My problem is that it's a bit too smart for my perl knowledge :P As you can see I tried to approach it in a very lengthy way because I could understand it better. I'll try to understand exactly what's happening in the code you posted, it is indeed very useful. Can I ask one more favour? Could you please maybe explain, when/if you find some time to read the rest of my post, what I'm doing wrong with populating the array? I've been googling for hours and don't seem to be able to get it. All I'm trying to write is that all the $match variables are meant to belong in the @match array, or does that make no sense? In my mind, I'm trying to find a way to print $match and have all the values from -10 to 10 printed. Does that sound correct or am I completely off? Update: Ok this is what I did and it's working! :) I modified my code from my original post (code attempt #2) by adding and editing your contribution. I believe it works correctly for my purpose, I'm now confirming that my output files are correct. #!/usr/bin/perl use warnings; use strict; my %fhs; my $molecule = "1kc4"; open my $FILE, '<', 'input_file' or die $!; while (<$FILE>) { chomp; my @columns = unpack('a8 a8 a8 a6'); #print join(" ",map {$_} @columns), "\n"; #print "@columns[3] \n"; foreach (@columns[$#columns]) { my $abs = abs( $columns[ $#columns ] ); open $fhs{ $abs }, '>', "${molecule}_cluster_" . $abs or die $! un +less exists $fhs{ $abs }; my $file = $fhs{ $abs }; my $ID = $_; my $IDform = sprintf ("%4s", $ID); my $currentline = $.; my $currentlineform = sprintf ("%7s", $currentline);## my @selection = (@columns[0..$#columns-1]); my $layout = "%10s"x(@selection) . "\n"; printf $file $IDform . $currentlineform . $layout, @selection; } } [download] `The output filenames are correct and one of them looks like this: 8 109 -42.129 -57.475 94.651 8 110 -45.520 -62.056 90.318 8 111 -49.196 -63.045 92.577 8 112 -46.086 -71.753 88.267 -8 113 -48.146 -76.799 77.638 8 114 -41.865 -62.567 86.437` [download] It would be great if you could take a look and let me know if you see something erroneous in my code. Thank you again for your time :)	[reply] [d/l] [select]
Re^3: divide multi-column input file into sub-files depending on specific column's value by BrowserUk (Patriarch) on Jul 05, 2016 at 14:37 UTC
Are you not getting loads of warnings when you run your code? For example: `Scalar value @columns[$#columns] better written as $columns[$#columns] at ...` With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :) In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re^4: divide multi-column input file into sub-files depending on specific column's value by angela2 (Sexton) on Jul 05, 2016 at 15:50 UTC
Re^5: divide multi-column input file into sub-files depending on specific column's value by AnomalousMonk (Archbishop) on Jul 05, 2016 at 16:34 UTC
Re^5: divide multi-column input file into sub-files depending on specific column's value by BrowserUk (Patriarch) on Jul 05, 2016 at 16:11 UTC
Some notes below your chosen depth have not been shown here
Re: divide multi-column input file into sub-files depending on specific column's value by Marshall (Canon) on Jul 05, 2016 at 10:55 UTC
here is a rather simple formulation of an approach. I leave it to you to get the formatting right and put in the appropriate file opens... $. is line number of current file. #!/usr/bin/perl use warnings; use strict; use Data::Dumper; my %hash; # make a hash of array based upon last column while (my $line = <DATA>) { my @tokens = split ' ', $line; my $abs_lastcol = abs $tokens[-1]; my $newline = "$tokens[-1] $. $tokens[0] $tokens[1] $tokens[2]"; push @{$hash{$abs_lastcol}}, $newline; } foreach my $file (sort keys %hash) { print "generate file: $file...\n"; print "$_\n" for @{$hash{$file}}; } =prints generate file: 1... 1 4 -56.247 83.685 114.576 -1 9 -58.722 60.429 111.307 generate file: 2... -2 5 -59.263 76.107 112.555 2 8 -61.667 63.707 116.937 generate file: 3... 3 6 -59.895 65.296 111.185 -3 7 -60.141 63.694 111.257 generate file: 6... -6 2 -61.251 77.435 117.760 6 10 -57.511 42.922 112.108 generate file: 7... 7 1 -59.077 89.301 115.664 -7 3 -60.950 71.712 116.061 =cut __DATA__ -59.077 89.301 115.664 7 -61.251 77.435 117.760 -6 -60.950 71.712 116.061 -7 -56.247 83.685 114.576 1 -59.263 76.107 112.555 -2 -59.895 65.296 111.185 3 -60.141 63.694 111.257 -3 -61.667 63.707 116.937 2 -58.722 60.429 111.307 -1 -57.511 42.922 112.108 6 [download]	[reply] [d/l]
Re: divide multi-column input file into sub-files depending on specific column's value by perldigious (Priest) on Jul 05, 2016 at 21:15 UTC
You would need to modify this for the formatting you want (I split your data on whitespace, manipulate it, and then join it again with simple tabs in between), but I believe the following code is one way to do what you are asking. #!/usr/bin/perl use warnings; use strict; open(my $FILE, '<', 'input_file.txt') or die $!; my %output_files; my $line_number = 1; while (<$FILE>) { chomp; my @columns = split; my $value = abs($columns[$#columns]); if(!exists $output_files{$value}) { open($output_files{$value}, '>', "test-out-$value.txt") or die + $!; } my $fh = $output_files{$value}; splice @columns, 0, 0, $columns[$#columns], $line_number; pop @columns; my $output_line = join "\t", @columns; print $fh "$output_line\n"; $line_number++; } [download] EDIT: Didn't notice you moved the last column to be the first column right away, updated code for that.	[reply] [d/l]
Re: divide multi-column input file into sub-files depending on specific column's value by Anonymous Monk on Jul 05, 2016 at 19:54 UTC
Just write straight to the files: #!/usr/bin/perl # http://perlmonks.org/?node_id=1167222 use strict; use warnings; my %files; while(<DATA>) { /(.\S)\s+(\S+)/ or next; $files{abs $2} or open $files{abs $2}, '>', "output_@{[abs $2]}.txt" + or die; printf {$files{abs $2}} "%3d %3d %s\n", $2, $., $1; } close $_ for values %files; system 'more output_.txt \| cat'; # for debugging __DATA__ -59.077 89.301 115.664 7 -61.251 77.435 117.760 -6 -60.950 71.712 116.061 -7 -56.247 83.685 114.576 1 -59.263 76.107 112.555 -2 -59.895 65.296 111.185 3 -60.141 63.694 111.257 -3 -61.667 63.707 116.937 2 -58.722 60.429 111.307 -1 -57.511 42.922 112.108 6 [download]	[reply] [d/l]