You need to use an array to preserve the order and a hash to gather the data. There is a module Tie:IxHash that will do this for you, but the performance penalty of tieing could be a factor as these files are large, and it would only slightly simplify the solution. A possible problem is that if the files are very large, and/or the strings very long, then you will need a lot of memory for this to work.

If thats the case, say so and another solution is possible.

#! perl -slw use strict; my (%hash,@order); while(<DATA>) { chomp; my @bits = split /\t/; push @order, $bits[0] unless exists $hash{$bits[0]}; $hash{$bits[0]} .= ' ' . $bits[1]; } print "$_\t$hash{$_}" for @order; __DATA__ text1 text-a text2 text-b text3 text-c text1 text-d text3 text-e text3 text-f

Output

D:\Perl\test>253229 text1 text-a text-d text2 text-b text3 text-c text-e text-f

Updated from here on.

As I mentioned above, if you dataset is too large to fit in memory (not forgetting the memory overhead of the array and hash), then you will need a different solution. I originally had a scheme in mind of writing several output files and then having a merge phase to combine them but there are several problems with this.

Then I remembered reading about mjd's Tie::File module. Don't be fooled by the unassuming name, this module is worth its weight in latinum:)

Essentially, Tie::File allows you to use an abitrarially huge arrays from within perl, with all the familiar features of perl arrays--including splice!

This module allows almost the same algorithm as used above to be used regardless (subject to OS filesize limits) of the size of the dataset involved. The only change required is to accumulate the records in the array rather than the %hash, which in turn means that you store the index of the array elements matching the key in the hash value.

I've also added a command line switch that will allow the input files to be processed in any number fo passes. Actually, it will do this by default if the output file exists. Use the switch (-NEWRUN) to empty the (hardcoded) output file if it exists. The demo currently reads __DATA__ each run as is, change this to while (<>) { and supply a list of filenames (or a wildcard under *nix, or add -MG if you have jenda's G.pm module under Win32).

#! perl -slw use strict; use Tie::File; use vars qw[$NEWRUN]; use constant OUTPUT_FILE => './output.dat'; # Empty the output file if this is a new run. (-NEWRUN on the command +line) # If this switch isn't present, then new data will be accumulated onto + the existing. unlink OUTPUT_FILE if $NEWRUN; tie my @accumulator, 'Tie::File', OUTPUT_FILE, memory => 20_000_000; # Adjust as required. See Tie::File pod for +other options. my %hash; # This line preloads the ordering info into the hash if this isn't a n +ew run unless($NEWRUN) { $hash{ (split/\t/, $accumulator[$_])[0] } = $_ for 0 .. $#accumula +tor ; } while ( <DATA> ) { # switching this to <> would allow a list of files +to be supplied chomp; my @bits = split /\t/; unless (exists $hash{$bits[0]}) { # unless we saw this type +already push @accumulator, $bits[0] . "\t"; # Add it to the end of the + array (file) $hash{$bits[0]} = $#accumulator; # And remember where in th +e hash } #append the new but to the appropriate array element (file record) +. $accumulator[ $hash{ $bits[0] } ] .= ' ' . $bits[1]; } untie @accumulator; __DATA__ text1 text-a text2 text-b text3 text-c text1 text-d text3 text-e text3 text-f

sample output

D:\Perl\test>253229-2 -NEWRUN D:\Perl\test>type output.dat text1 text-a text-d text2 text-b text3 text-c text-e text-f D:\Perl\test>253229-2 D:\Perl\test>type output.dat text1 text-a text-d text-a text-d text2 text-b text-b text3 text-c text-e text-f text-c text-e text-f
>

Note: This code is really walking on the shoulders of a giant. Dominus deserves the credit and any XP you wish to contribute. Pick a random node and vote:)


Examine what is said, not who speaks.
1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.
2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible
3) Any sufficiently advanced technology is indistinguishable from magic.
Arthur C. Clarke.

In reply to Re: Removing redundancy by BrowserUk
in thread Removing redundancy by dr_jgbn

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.