in reply to Merging larges files by columns

As others said, some sample data would be helpful. But looking at your working-but-slow script, I see that you're looping completely through file2 for every line of file1. That's going to be brutal if file2 is very large. You could speed it up some by at least breaking out of your loop through file2 once you find your match.

Better would be to first read file2 into a hash, with the first field (the one you match your counter against) as the keys, and then check that hash for each line of file1. If file2 is so large that reading it into a hash would present memory problems, you could tie it to a DBM file, and that way the dbm library can put as much of it on disk as necessary.

Replies are listed 'Best First'.
Re^2: Merging larges files by columns
by Marshall (Canon) on Sep 17, 2011 at 02:04 UTC
    I would suggest sorting file2 by number. unix sort command. There is a Windows version also.

    The idea is to have both file1 and file2 in an order such that it is just a stepwise ladder walk. Our mythical ladder walker has one foot on ladder1 and one foot on ladder2 and if a foot moves, it moves upward.

    This takes a long time:

    foreach line in file1... { loop through lines in file2... { ...many comparisons... decide which one is next } }
    A step-wise walker algorithm is a merge of 2 sorted files and is much faster because all that is needed is to decide who goes next, like 2 cars at a stop sign.
    foreach line in file1... { # insert file2 line(s) that should go before the # current line from file1 while {the current line in file2 is "next" in total order} { output that line from file2; move to next line in file2 } output the file1 line } output rest of file2 lines (if any)

      That (like most of the answers) assumes that the lines in the second file are numbered consecutively with no numbers missing. I wasn't sure from the question whether that was the case, or if when he said, "The file which I am appending a single column has the line number in the first column," he meant it had the number of the line it needed to be attached to.

      In other words, the second file could be:

      2 red 4 brown 3 blue 6 orange 7 yellow

      If 'yellow' is supposed to be appended to the 7th line of file1, then sorting file2 numerically and stepping through both files equally won't work, because 'lines' 1 and 5 are missing.

      By using the numbers in file2 as keys in a lookup hash, it doesn't matter if file2 is missing a number somewhere. I guess it really just depends on how sure you are that you can count on the numbers truly being line numbers.

        That (like most of the answers) assumes that the lines in the second file are numbered consecutively with no numbers missing.
        Not so. I suggested sorting file 2 so that the numbers are always increasing. That way a simple walker program suffices and could account for missing numbers. The simple sort strategy is often very effective as the system sort command is usually efficient enough - certainly will be a huge improvement upon read file2 N times!!

        In this case, the original poster seems to have disappeared and we are left speculating about requirements that we can't know unless the OP tells us.