Re: Removing redundancy

You need to use an array to preserve the order and a hash to gather the data. There is a module Tie:IxHash that will do this for you, but the performance penalty of tieing could be a factor as these files are large, and it would only slightly simplify the solution. A possible problem is that if the files are very large, and/or the strings very long, then you will need a lot of memory for this to work.

If thats the case, say so and another solution is possible.

#! perl -slw
use strict;

my (%hash,@order);
while(<DATA>) {
    chomp;
    my @bits = split /\t/;
    push @order, $bits[0] unless exists $hash{$bits[0]};
    $hash{$bits[0]} .= ' ' . $bits[1];
}
print "$_\t$hash{$_}" for @order;

__DATA__
text1    text-a
text2    text-b
text3    text-c
text1    text-d
text3    text-e
text3    text-f
[download]

Output

D:\Perl\test>253229
text1    text-a text-d
text2    text-b
text3    text-c text-e text-f
[download]

Updated from here on.

As I mentioned above, if you dataset is too large to fit in memory (not forgetting the memory overhead of the array and hash), then you will need a different solution. I originally had a scheme in mind of writing several output files and then having a merge phase to combine them but there are several problems with this.

The first is deciding when to switch to a new output file
You could just choose an arbitrary number of records, but unless your input records are of a consistant size, then you could still run into problems unles you set the limit quiet low, and then the merge phase gets more complicated.
You could use Devel::Peek to track the amount of memory being used and write as you approach some maximum. How you decide this might mean resorting to empirical testing.
The second problem is the need to retain the ordering between the data files.
This requires retaining the @order array in memory, which further limits the size of the %hash.
It also means that if you need to process the input files in several runs, you would need to serialise @order to a seperate file between runs--messy.

Then I remembered reading about mjd's Tie::File module. Don't be fooled by the unassuming name, this module is worth its weight in latinum:)

Essentially, Tie::File allows you to use an abitrarially huge arrays from within perl, with all the familiar features of perl arrays--including splice!

This module allows almost the same algorithm as used above to be used regardless (subject to OS filesize limits) of the size of the dataset involved. The only change required is to accumulate the records in the array rather than the %hash, which in turn means that you store the index of the array elements matching the key in the hash value.

I've also added a command line switch that will allow the input files to be processed in any number fo passes. Actually, it will do this by default if the output file exists. Use the switch (-NEWRUN) to empty the (hardcoded) output file if it exists. The demo currently reads __DATA__ each run as is, change this to while (<>) { and supply a list of filenames (or a wildcard under *nix, or add -MG if you have jenda's G.pm module under Win32).

#! perl -slw
use strict;
use Tie::File;

use vars qw[$NEWRUN];
use constant OUTPUT_FILE => './output.dat';

# Empty the output file if this is a new run. (-NEWRUN on the command 
+line)
# If this switch isn't present, then new data will be accumulated onto
+ the existing.
unlink OUTPUT_FILE if $NEWRUN;

tie my @accumulator, 'Tie::File', OUTPUT_FILE,
    memory => 20_000_000; # Adjust as required. See Tie::File pod for 
+other options.

my %hash;

# This line preloads the ordering info into the hash if this isn't a n
+ew run
unless($NEWRUN) {
    $hash{ (split/\t/, $accumulator[$_])[0] } = $_ for 0 .. $#accumula
+tor ;
}

while ( <DATA> ) { # switching this to <> would allow a list of files 
+to be supplied
    chomp;
    my @bits = split /\t/;

    unless (exists $hash{$bits[0]}) {       # unless we saw this type 
+already
        push @accumulator, $bits[0] . "\t"; # Add it to the end of the
+ array (file)
        $hash{$bits[0]} = $#accumulator;    # And remember where in th
+e hash
    }
    #append the new but to the appropriate array element (file record)
+.
    $accumulator[ $hash{ $bits[0] } ] .= ' ' . $bits[1];
}

untie @accumulator;

__DATA__
text1    text-a
text2    text-b
text3    text-c
text1    text-d
text3    text-e
text3    text-f
[download]

sample output

D:\Perl\test>253229-2 -NEWRUN

D:\Perl\test>type output.dat
text1    text-a text-d
text2    text-b
text3    text-c text-e text-f

D:\Perl\test>253229-2

D:\Perl\test>type output.dat
text1    text-a text-d text-a text-d
text2    text-b text-b
text3    text-c text-e text-f text-c text-e text-f
[download]

Note: This code is really walking on the shoulders of a giant. Dominus deserves the credit and any XP you wish to contribute. Pick a random node and vote:)

Examine what is said, not who speaks.

1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.
2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible
3) Any sufficiently advanced technology is indistinguishable from magic.
Arthur C. Clarke.

Comment on Re: Removing redundancy Select or Download Code