Merging larges files by columns

ScottJohn has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I am fairly new to Perl and am running into a problem that I hope the Perl Monks can help shed some light on.

I am trying to add a column to an existing file with hundreds of columns and hundreds of thousands of rows. For what it's worth, these files have the same number of rows. The rows also correspond to each other. The file which I am appending a single column has the line number in the first column. (I am appending the second column.)

The following code that works, but it takes forever and I will need to do this daily. I am sure there are several inefficiencies and it seems there has to be a faster way.

use warnings;
use strict;

open (OUTFILE, ">outfile.csv");
open (IN1, "<infile1.csv");
 
my $counter = 0;
 
while (<IN1>){
chomp;
print OUTFILE $_;

open (IN2, "<infile2.csv");
while (<IN2>){
my @f = split(/,/);
if ($counter == $f[0]){
print OUTFILE ",$f[1]\n";
}}
close IN2;
$counter = $counter + 1;
}

close IN1;
close OUTFILE;
[download]

Thanks in advance!

Scott

Thanks for all the help. To clarify, the numbered file was in order.

This is the code that seemed to work best (similar to jwkrahn's code).

use warnings;
use strict;
 
 
open my $in2, '<', 'infile2.csv' or die "Cannot open 'infile2.csv' bec
+ause: $!";
 
my %data;
while ( <$in2> ) {
    my ( $key, $value ) = split /,/;
    $data{ $key } = $value;
    }
 
close $in2;
 
 
open my $outfile, '>', 'outfile.csv' or die "Cannot open 'StatsCubeR2f
+ormat.csv' because: $!";
open my $in1, '<' , 'infile1.csv' or die "Cannot open 'infile1.csv' be
+cause: $!";
 
while ( <$in1> ) {
    chomp;
    print $outfile "$_";
    if ( exists $data{ $. - 1 } ) {
        print $outfile ",$data{$. - 1}\n";
        }
    else {
                print $outfile ",field_name\n";
                }
    }
 
close $in1;
close $outfile;
[download]

Comment on Merging larges files by columns Select or Download Code

Replies are listed 'Best First'.
Re: Merging larges files by columns by lidden (Curate) on Sep 16, 2011 at 21:04 UTC
If think you want something like this. (untested) `open my $OUTFILE, ">", "outfile.csv" or die "oh noo: $!"; open my $IN1, "<", "infile1.csv" or die "bummer1: $!"; open my $IN2, "<", "infile2.csv" or die "bummer2: $!"; while (<$IN1>){ chomp; print $OUTFILE $_; my $line2 =<$IN2>; chomp $line2; my @f = split/,/, $line2; print $OUTFILE ",$f[1]\n"; } close $IN2 or die $!; close $IN1 or die $!; close $OUTFILE or die $!;` [download]	[reply] [d/l]
Re^2: Merging larges files by columns by Kc12349 (Monk) on Sep 16, 2011 at 21:06 UTC
I was about to post my very similar code. I would point out that if you are new to perl and not set in your ways, try to use lexical file handles as in my example as opposed to barewords. `use autodie; open (my $out, '>', 'outfile.csv'); open (my $in1, '<', 'infile1.csv'); open (my $in2, '<', 'infile2.csv'); while ( defined (my $line = <$in1>) ) { chomp $line; print {$out} $line; chomp(my $in2_line = <$in2>); my @f = split m/,/, $in2_line; print {$out} ',' . $f[1] . "\n"; } close $in1; close $in2; close $out;` [download]	[reply] [d/l]
Re: Merging larges files by columns by aaron_baugher (Curate) on Sep 16, 2011 at 21:19 UTC
As others said, some sample data would be helpful. But looking at your working-but-slow script, I see that you're looping completely through file2 for every line of file1. That's going to be brutal if file2 is very large. You could speed it up some by at least breaking out of your loop through file2 once you find your match. Better would be to first read file2 into a hash, with the first field (the one you match your counter against) as the keys, and then check that hash for each line of file1. If file2 is so large that reading it into a hash would present memory problems, you could tie it to a DBM file, and that way the dbm library can put as much of it on disk as necessary.	[reply]
Re^2: Merging larges files by columns by Marshall (Canon) on Sep 17, 2011 at 02:04 UTC
I would suggest sorting file2 by number. unix sort command. There is a Windows version also. The idea is to have both file1 and file2 in an order such that it is just a stepwise ladder walk. Our mythical ladder walker has one foot on ladder1 and one foot on ladder2 and if a foot moves, it moves upward. This takes a long time: `foreach line in file1... { loop through lines in file2... { ...many comparisons... decide which one is next } }` [download] A step-wise walker algorithm is a merge of 2 sorted files and is much faster because all that is needed is to decide who goes next, like 2 cars at a stop sign. `foreach line in file1... { # insert file2 line(s) that should go before the # current line from file1 while {the current line in file2 is "next" in total order} { output that line from file2; move to next line in file2 } output the file1 line } output rest of file2 lines (if any)` [download]	[reply] [d/l] [select]
Re^3: Merging larges files by columns by aaron_baugher (Curate) on Sep 17, 2011 at 11:53 UTC
That (like most of the answers) assumes that the lines in the second file are numbered consecutively with no numbers missing. I wasn't sure from the question whether that was the case, or if when he said, "The file which I am appending a single column has the line number in the first column," he meant it had the number of the line it needed to be attached to. In other words, the second file could be: `2 red 4 brown 3 blue 6 orange 7 yellow` [download] If 'yellow' is supposed to be appended to the 7th line of file1, then sorting file2 numerically and stepping through both files equally won't work, because 'lines' 1 and 5 are missing. By using the numbers in file2 as keys in a lookup hash, it doesn't matter if file2 is missing a number somewhere. I guess it really just depends on how sure you are that you can count on the numbers truly being line numbers.	[reply] [d/l]
Re^4: Merging larges files by columns by Marshall (Canon) on Sep 19, 2011 at 00:59 UTC
Re: Merging larges files by columns by BrowserUk (Patriarch) on Sep 16, 2011 at 21:11 UTC
Try: `perl -ne"chomp; print $_,',',(split ',',<STDIN>)[-1]" in1.csv < in2.cs +v >out.csv` [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re: Merging larges files by columns by repellent (Priest) on Sep 17, 2011 at 06:49 UTC
`$ cat file1.csv a,b,c d,e,f g,h,i j,k,l $ cat file2.csv 0,w 1,x 2,y 3,z` [download] `$paste` `-d, file1.csv <(cut` `-d, -f2 file2.csv) > merged.csv` `$ cat merged.csv a,b,c,w d,e,f,x g,h,i,y j,k,l,z` [download]	[reply] [d/l] [select]
Re: Merging larges files by columns by pvaldes (Chaplain) on Sep 16, 2011 at 21:11 UTC
`it takes forever` Why I'm not surprised? `while (<IN1>){ chomp; print OUTFILE $_;` [download] A complicated way to copy $infile to $outfile, better use File::Copy	[reply] [d/l] [select]
Re: Merging larges files by columns by jwkrahn (Abbot) on Sep 17, 2011 at 02:41 UTC
Yhis may work better (UNTESTED): use warnings; use strict; open my $IN2, '<', 'infile2.csv' or die "Cannot open 'infile2.csv' bec +ause: $!"; my %data; while ( <$IN2> ) { my ( $key, $value ) = split /,/; $data{ $key } = $value; } close $IN2; open my $OUTFILE, '>', 'outfile.csv' or die "Cannot open 'outfile.csv' + because: $!"; open my $IN1, '<", 'infile1.csv' or die "Cannot open 'infile1.csv' bec +ause: $!"; while ( <$IN1> ) { chomp; print $OUTFILE $_; if ( exists $data{ $. - 1 ) ) { print $OUTFILE ",$data{$. - 1}\n"; } } close $IN1; close $OUTFILE; [download]	[reply] [d/l]
Re: Merging larges files by columns by Kc12349 (Monk) on Sep 16, 2011 at 20:54 UTC
Can you give us some sample data from your two input files? It's hard to get a grasp on what you are trying to do without seeing the data. Are the two input files the same length of lines?	[reply]
Re: Merging larges files by columns by Anonymous Monk on Sep 19, 2011 at 03:31 UTC
Although it does not involve Perl, if you have a Unix like system perhaps one of the following commands already does what you want: paste lam join At very least it gives you a speed comparison for your Perl code.	[reply]