in reply to Removing redundancy
You need to use an array to preserve the order and a hash to gather the data. There is a module Tie:IxHash that will do this for you, but the performance penalty of tieing could be a factor as these files are large, and it would only slightly simplify the solution. A possible problem is that if the files are very large, and/or the strings very long, then you will need a lot of memory for this to work.
If thats the case, say so and another solution is possible.
#! perl -slw use strict; my (%hash,@order); while(<DATA>) { chomp; my @bits = split /\t/; push @order, $bits[0] unless exists $hash{$bits[0]}; $hash{$bits[0]} .= ' ' . $bits[1]; } print "$_\t$hash{$_}" for @order; __DATA__ text1 text-a text2 text-b text3 text-c text1 text-d text3 text-e text3 text-f
Output
D:\Perl\test>253229 text1 text-a text-d text2 text-b text3 text-c text-e text-f
Updated from here on.
As I mentioned above, if you dataset is too large to fit in memory (not forgetting the memory overhead of the array and hash), then you will need a different solution. I originally had a scheme in mind of writing several output files and then having a merge phase to combine them but there are several problems with this.
You could just choose an arbitrary number of records, but unless your input records are of a consistant size, then you could still run into problems unles you set the limit quiet low, and then the merge phase gets more complicated.
You could use Devel::Peek to track the amount of memory being used and write as you approach some maximum. How you decide this might mean resorting to empirical testing.
This requires retaining the @order array in memory, which further limits the size of the %hash.
It also means that if you need to process the input files in several runs, you would need to serialise @order to a seperate file between runs--messy.
Then I remembered reading about mjd's Tie::File module. Don't be fooled by the unassuming name, this module is worth its weight in latinum:)
Essentially, Tie::File allows you to use an abitrarially huge arrays from within perl, with all the familiar features of perl arrays--including splice!
This module allows almost the same algorithm as used above to be used regardless (subject to OS filesize limits) of the size of the dataset involved. The only change required is to accumulate the records in the array rather than the %hash, which in turn means that you store the index of the array elements matching the key in the hash value.
I've also added a command line switch that will allow the input files to be processed in any number fo passes. Actually, it will do this by default if the output file exists. Use the switch (-NEWRUN) to empty the (hardcoded) output file if it exists. The demo currently reads __DATA__ each run as is, change this to while (<>) { and supply a list of filenames (or a wildcard under *nix, or add -MG if you have jenda's G.pm module under Win32).
#! perl -slw use strict; use Tie::File; use vars qw[$NEWRUN]; use constant OUTPUT_FILE => './output.dat'; # Empty the output file if this is a new run. (-NEWRUN on the command +line) # If this switch isn't present, then new data will be accumulated onto + the existing. unlink OUTPUT_FILE if $NEWRUN; tie my @accumulator, 'Tie::File', OUTPUT_FILE, memory => 20_000_000; # Adjust as required. See Tie::File pod for +other options. my %hash; # This line preloads the ordering info into the hash if this isn't a n +ew run unless($NEWRUN) { $hash{ (split/\t/, $accumulator[$_])[0] } = $_ for 0 .. $#accumula +tor ; } while ( <DATA> ) { # switching this to <> would allow a list of files +to be supplied chomp; my @bits = split /\t/; unless (exists $hash{$bits[0]}) { # unless we saw this type +already push @accumulator, $bits[0] . "\t"; # Add it to the end of the + array (file) $hash{$bits[0]} = $#accumulator; # And remember where in th +e hash } #append the new but to the appropriate array element (file record) +. $accumulator[ $hash{ $bits[0] } ] .= ' ' . $bits[1]; } untie @accumulator; __DATA__ text1 text-a text2 text-b text3 text-c text1 text-d text3 text-e text3 text-f
sample output
>D:\Perl\test>253229-2 -NEWRUN D:\Perl\test>type output.dat text1 text-a text-d text2 text-b text3 text-c text-e text-f D:\Perl\test>253229-2 D:\Perl\test>type output.dat text1 text-a text-d text-a text-d text2 text-b text-b text3 text-c text-e text-f text-c text-e text-f
Note: This code is really walking on the shoulders of a giant. Dominus deserves the credit and any XP you wish to contribute. Pick a random node and vote:)
|
|---|