Re: Split a large text file by columns

It would be helpful if you could post a small subset of some actual data. I guess you have something like this?:

row_nameA:col2:col3:col4:col5:col6...col200
row_nameB:col2:col3:col4:col5:col6...col200
[download]

It is completely unclear what separates the columns (above, ':')). This detail does matter. I would give us say, the first 7 columns x 3 rows and put that data within <code></code> tags. Then show us your "best go" at this problem so far in Perl. Can you give more info about the size of this input file? How many rows? I suspect that the entire input file will fit comfortably in memory and that generating the 70 or so output files can proceed in a straightforward way. There are a number of techniques to do this. Your question is still too general to get a concrete answer other than "heck yes, Perl can do it!". Oh, also mention if performance is of any concern at all? I don't expect that to be an issue here because most of the time will be taken by I/O, generating the plethora of output files.

I guess there is the additional question, at least for my own curiosity: why are you doing this? Your application just seems odd enough (200 cols, 3 cols per file, about 70 output files), that perhaps there is a better way to do whatever it is that you are trying to do. This might be what is known as an X-Y problem.

Update:
Below is some code for one way to do this, there are other ways.

The code reads the input file and makes a 2-D array of the data. I presume that this amount of data will "fit" into memory without problems. If the line format is complex, then perhaps a .CSV module will be needed to parse each line? At each iteration of generating a new file, column 1 (the name) is reused and then the next left most 3 columns of data are consumed (the @data array "shrinks"). The loop ends when only column 1 remains of the original data.

#!/usr/bin/perl
use strict;
use warnings;

my $number_of_cols_per_file =3;
my @data;                         #this is a 2-D Array

while (my $line = <DATA>)
{
   chomp $line;
   my (@cols) = split (':', $line);
   push @data, \@cols;
}

my $file_num=1;

while (@{$data[0]} > 1) # Any Columns after the name_column left?
{
    # generate the next file
    
    # This print would change to a "file open" statement for
    # file_num, n...
    
    print "File Number = ",$file_num++,"\n";
    
    foreach my $row_ref (@data)
    {
        my $row_name=$row_ref->[0];
        my @data_cols = splice (@$row_ref,1,$number_of_cols_per_file);
        print join(":", $row_name, @data_cols), "\n";
    }
}

=Prints:
File Number = 1
row_nameA:col2a:col3a:col4a
row_nameB:col2b:col3b:col4b
File Number = 2
row_nameA:col5a:col6a
row_nameB:col5b:col6b
=cut

__DATA__
row_nameA:col2a:col3a:col4a:col5a:col6a
row_nameB:col2b:col3b:col4b:col5b:col6b
[download]

Comment on Re: Split a large text file by columns Select or Download Code

Replies are listed 'Best First'.
Re^2: Split a large text file by columns by tc (Novice) on Apr 21, 2017 at 18:17 UTC
Thank you Marshal. I appreciate your help. The file columns are tab separated. A small sample of the data file is below `<GSOR> vnir_1 vnir_2 vnir_3 vnir_4 vnir_5 vnir_6 vnir_6 310015 0.37042 0.36909 0.36886 0.36698 0.36615 0.364 +49 0.36404 310100 0.25889 0.25773 0.2569 0.25563 0.25565 0.2551 +1 0.25508 310134 0.26163 0.26149 0.26059 0.26034 0.2604 0.2598 + 0.26085 310167 0.23168 0.23031 0.23045 0.22822 0.2267 0.2257 +5 0.22453 310196 0.26995 0.26902 0.2685 0.26689 0.26624 0.2647 + 0.26461` [download]	[reply] [d/l]
Re^2: Split a large text file by columns by tc (Novice) on Apr 21, 2017 at 20:46 UTC
I will try and provide as much detail about the file and why I am trying to do this. The file is a set of wavelength measurements collected from about 200 individual plants. instrument measures a response for each of the wavelength between 300 nm to 1000 nm in different intervals. Usually, the file contains between 100 to 200 columns depending on the settings and 200 rows (each plant ID). There are about 20 files in total when measurements are completed. I nedd to separate each column and keep the first column(rownames) with each column. I will then analyze each of these files for genetic information for research. The memory is of no issue since I can run it on a server. I agree, my application is odd for what perl is used for, but I started teaching myself to use it since I am working with large files and need an efficient way to process files. I have been able to write a couple scripts that has made a few tedious task efficient and less mistake prone. That's the gist of it. Thank you again, Monks.	[reply]
Re^3: Split a large text file by columns by kevbot (Vicar) on Apr 22, 2017 at 04:45 UTC
Here is a way to perform what you described (with the help of the Data::Table and Path::Tiny cpan modules). I'm assuming that there was a typo in the data you provided, and I changed the name of the last column to `vnir_7`. I put the following tab-delimited data into a file called `data.tsv`, `<GSOR> vnir_1 vnir_2 vnir_3 vnir_4 vnir_5 vnir_6 +vnir_7 310015 0.37042 0.36909 0.36886 0.36698 0.36615 0.364 +49 0.36404 310100 0.25889 0.25773 0.2569 0.25563 0.25565 0.2551 +1 0.25508 310134 0.26163 0.26149 0.26059 0.26034 0.2604 0.2598 + 0.26085 310167 0.23168 0.23031 0.23045 0.22822 0.2267 0.2257 +5 0.22453 310196 0.26995 0.26902 0.2685 0.26689 0.26624 0.2647 + 0.26461` [download] This script processes the data and creates the files, #!/usr/bin/env perl use strict; use warnings; use Data::Table; use Path::Tiny; # Load the tsv file with a header my $dt = Data::Table::fromTSV('data.tsv', 1 ); # Get a Data::Table that contains only the first column my $names_dt = $dt->subTable( undef, [ '<GSOR>' ] ); my $n_col = $dt->nofCol; my @column_names = $dt->header; for( my $i = 1; $i <= $n_col - 1; ++$i ){ my $col_name = $column_names[ $i ]; my $col_dt = $dt->subTable( undef, [ $col_name ] ); my $new_dt = $names_dt->clone(); $new_dt->colMerge($col_dt); my $file_name = "file_$i.tsv"; my $fh = path($file_name)->openw_utf8; print {$fh} $new_dt->tsv; $fh->close; } exit; [download]	[reply] [d/l] [select]
Re^4: Split a large text file by columns by tc (Novice) on Apr 22, 2017 at 18:41 UTC
Hi Kevbot, thank you much. The code that you wrote works perfectly. I see that I still have a long way to go about learning perl. I did not think of the different types of modules that make such tasks possible. My next task is to learn about the modules that are most useful. My humblest thank you.	[reply]
Re^2: Split a large text file by columns by tc (Novice) on Apr 22, 2017 at 19:05 UTC
Hi Marshall, thank you sir for your knowledge and help. The code that you wrote does the job and it works great. I spent a week trying to write a code that does this. I have a lot to learn. Your time and knowledge is most appreciated.	[reply]