nanoplasmonic has asked for the wisdom of the Perl Monks concerning the following question:

I am using Perl to parse a number of tab delimited text files stored in a directory. The code extracts a particular column of data and stores it in a matrix. After all the files in the directory have been looped through, the contents of the matrix along with filenames are printed in an user-specified output file. I am fairly new to Perl and the code which I modified from some online source is given below. The directory with the new set of data has become very large (2.09 GB with 68,668 files) and now I am getting "out of memory' errors. So, I was wondering if it is possible to read the files one a time and print to an existing text file by appending new columns. I can already do this by appending the new data as a new line but this becomes hard to read in the data analysis programs.

#arguments with command to run must be Directory name, Column number, + Output file $delimiter = "\t"; #input name of the data directory from command line $dir = @ARGV[0]; #input number of the column to be read from command line. 0 for the fi +rst column $columnNo = @ARGV[1]; #input name of the output file from command line $outfile = @ARGV[2]; #reminder for format of command statement if(!$dir or (!$columnNo && $columnNo ne '0') ) { print "No input data directory or column number.\nUsage: perl my_perl. +pl dir columnNo outfile\n"; } #read the directory opendir(DIR, $dir) or die "can not open directory $dir\n"; while($name = readdir(DIR)) { #save data file names in an array, don't include . and .. push(@files, $name) if( !($name eq '.' || $name eq '..') ); } closedir(DIR) or die "can not close directory $dir\n"; #process data if($dir && ($columnNo || $columnNo eq '0') ) { #read files for($i = 0; $i < @files; $i++) { $infile = $dir.'/'.@files[$i]; #each individual file open(IN, $infile) or die "can not open $infile\n"; #row number, 0 for first row. Reset for each file $rowNo = 0; #read the file while( $line = <IN> ) { # get rid of the new line character, otherwise data in the last col +umn incorrect chomp($line); # split to put data in each row into an array @data = split(/$delimiter/, $line); # remember data in a "matrix". $datamatrix{$i, $rowNo} = @data[$columnNo]; # add 1 to row number $rowNo++; } close(IN) or die "can not close $infile\n"; } } # print results. if($outfile) { open (OUT, ">$outfile") or die "can not open $outfile\n"; # number of columns print OUT 'The number of columns is: ' . scalar @files . "\n"; # first row file names for($i = 0; $i < @files; $i++) { print OUT @files[$i]; print OUT "\t" if($i < @files -1); } print OUT "\n"; # data for($j = 0; $j < $rowNo; $j++) { for($i = 0; $i < @files; $i++) { print OUT $datamatrix{$i, $j}; print OUT "\t" if($i < @files -1); } print OUT "\n"; } close(OUT) or die "can not close $outfile\n"; } else { print 'The number of columns is: ' . scalar @files . "\n"; # first row file names for($i = 0; $i < @files; $i++) { print @files[$i]; print "\t" if($i < @files -1); } print "\n"; # data for($j = 0; $j < $rowNo; $j++) { for($i = 0; $i < @files; $i++) { print $datamatrix{$i, $j}; print "\t" if($i < @files -1); } print "\n"; } }

Replies are listed 'Best First'.
Re: Adding columns in a loop to an existing file using Perl
by aaron_baugher (Curate) on Oct 24, 2013 at 05:53 UTC

    My first question would be how many input files you have. If it's not more than the number of files you can open at once, then I'd do this (pseudocode):

    open all input files, making array of file descriptors open output file for writing while the first input file has remaining lines for each input file descriptor read one line get desired column write to output file write newline to output file close all files

    If your input files have different numbers of lines, you may need a bit more to handle that. But basically, open all files and process the first line of each input file, creating the first line of the output file. Then move on to the second lines, then the third, etc.

    If there are more files than you can have open simultaneously, you'll have to do something else. Yes, you can repeatedly add columns to the end of your output file, preferably using something like Text::CSV to keep things correct, but that would be pretty inefficient, so don't do that if you don't have to. Honestly, I'd probably do this with shell tools, which are pretty handy for things like breaking lines on a delimiter (when you don't have to worry about things like quoted delimiters within fields) and handling multiple columns of text:

    #!/bin/sh # args: $1 - directory holding input files # $2 - column to save # $3 - output file # temporary files created in /tmp/columns n=0 for i in $1/*; do n=`expr $n + 1` echo $i >/tmp/columns/$n cut -f $2 $i >>/tmp/columns/$n done paste /tmp/columns/* >$3 rm /tmp/columns/*

    Aaron B.
    Available for small or large Perl jobs; see my home node.

Re: Adding columns in a loop to an existing file using Perl
by Anonymous Monk on Oct 24, 2013 at 02:36 UTC

    I might help if you provided something like this

    But no matter, here is the answer

Re: Adding columns in a loop to an existing file using Perl
by Lennotoecom (Pilgrim) on Oct 24, 2013 at 07:48 UTC
    I suppose it is better to put your found columns in some sort of a db like mysql.
    It is easier, especially if your columns have not the same size.
    A B C A B C A B B

    if you put your found columns in db, you can run through your original files,
    and at the end do a single select.