Processing files column-wise

iangibson has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I want to process several huge text files (millions of lines each and over 1,000 columns). What I want to do for each file is to copy the first nine columns into four separate new files, then starting from the tenth column I want to copy each column as a new column in one of the four new files, based on the column header (which file to copy to is determined by looking up the header in a separate ID file).

So I don't want to process the files by line, but rather by column. I imagined that a problem similar to this would be a fairly common task, but I have been searching for an appropriate method in vain. I've also looked at Tie::Handle::CSV and Text::CSV, but these modules seem to only process a file line-wise, not column-wise, which considering the size of my files would be quite inefficient and complex (once the column header is read, this is all the information necessary to determine where to copy the entire column to).

Any pointers as to where to look to get started, or working examples of a similar nature would be most appreciated.

Comment on Processing files column-wise

Replies are listed 'Best First'.
Re: Processing files column-wise by BrowserUk (Patriarch) on Feb 22, 2012 at 22:49 UTC
I've also looked at Tie::Handle::CSV and Text::CSV, but these modules seem to only process a file line-wise, not column-wise, which considering the size of my files would be quite inefficient and complex (once the column header is read, this is all the information necessary to determine where to copy the entire column to). There is no mechanism for reading a column from a file without reading the file line by line. That's just the way files work. But, line-by-line processing of files is perfectly efficient. Provided that you do not have to re-process each line for each column. That means placing all the fields from the first line into their respective files, before reading and processing the second line. This makes a lot of assumptions about the formatting of your ids and data files, but may serve to illustrate the technique even if you need to use one of the bastardised csv format processors. Update: reversing the %ids array -- ie. using the filenos as the keys and pushing the field nos to an array as the value would save having to grep the hash 4 times for every record. This is untested beyond basic syntax checking: #! perl -slw use strict; use Data::Dump qw[ pp ]; ## Assumes that the IDs file consists of space separated lines ## zero-based-column-no-in-data-file zero-based-fileno-destination open IDS, '<', 'ids.map' or die $!; my %ids = map{ my( $columnNo, $fileNo ) = split; $columnNo -= 10; ## adjust column numbers ( $columnNo, $fileNo ); }<IDS>; close IDS; chomp %ids; ## for each data filenmae supplied on the command line for my $filename ( @ARGV ) { ## open that file for input open IN, '<',.$filename or die $!; ## open 4 output files named as $filename.out.n my @outs; open $outs[ $_ ], '>', "$filename.out.$_" for 0 .. 3; ## read the data file line by line while( <IN> ) { ## split the line into fields -- assumes sane csv definition my @fields = split '\s,\s', $_; ## print the first 10 fields to each of the 4 files ## and remove them from the @fields array printf { $outs[ $_ ] } "%s, ", join ', ', splice @fields[ 0 .. 9 ] for 0 .. 3; splice @fields, 0, 9; ## for each of the output files for my $fileNo ( 0 .. 3 ) { ## print those fields ... print { $outs[ $fileNo ] } join ', ', @fields[ ## that are mapped to this file grep{ $ids{ $_ } == $fileNo } 0 .. $#fields ]; } } ## cleanup close $outs[ $_ ] for 0 .. 3; close IN; } [download] With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply] [d/l]
Re: Processing files column-wise by aaron_baugher (Curate) on Feb 22, 2012 at 21:16 UTC
This still sounds like line-by-line processing. If not, maybe you can give some example data, input and output. But I'm picturing this: Read more... (483 Bytes) Aaron B. My Woefully Neglected Blog, where I occasionally mention Perl.	[reply] [d/l]
Re: Processing files column-wise by JavaFan (Canon) on Feb 22, 2012 at 21:35 UTC
I cannot figure out what the end result is you are aiming for (you lost me after "starting from the tenth column"). But if you want to cut out the first 9 columns of a file, use the `cut` utility. For instance (untested): `use autodie; open my $fh, "cut 1-9 -d ' ' $file1 $file2 $file3 $file4 \|"; while (<$fh>) { my @columns = split ' '; ... Do something with 9 columns ... }` [download]	[reply] [d/l] [select]
A reply falls below the community's threshold of quality. You may see it by logging in.