dr_jgbn has asked for the wisdom of the Perl Monks concerning the following question:

Hello Perl Monks,

I am having trouble trying to create an ultra efficient script that both removes redundancy while at the same time combines data. I have several huge files that I have been given from my work in the format:
(Two columns of data that are tab-delimited)
text1 text-a
text2 text-b
text3 text-c
text1 text-d
text3 text-e
text3 text-f
etc...
So...I need to remove the redundancy in the first column, but in addition, each time I encounter the same variable I need to concatenate the data found in the second column.
In this example the output would look like:
text1 text-a text-d
text2 text-b
text3 text-c text-e text-f
The order needs to be preserved in the first column.
I tried to do something with a hash but got nowhere...not even sure if this is the best method.
Thanks for any help,
Dr.J

Replies are listed 'Best First'.
Re: Removing redundancy
by BrowserUk (Patriarch) on Apr 26, 2003 at 02:50 UTC

    You need to use an array to preserve the order and a hash to gather the data. There is a module Tie:IxHash that will do this for you, but the performance penalty of tieing could be a factor as these files are large, and it would only slightly simplify the solution. A possible problem is that if the files are very large, and/or the strings very long, then you will need a lot of memory for this to work.

    If thats the case, say so and another solution is possible.

    #! perl -slw use strict; my (%hash,@order); while(<DATA>) { chomp; my @bits = split /\t/; push @order, $bits[0] unless exists $hash{$bits[0]}; $hash{$bits[0]} .= ' ' . $bits[1]; } print "$_\t$hash{$_}" for @order; __DATA__ text1 text-a text2 text-b text3 text-c text1 text-d text3 text-e text3 text-f

    Output

    D:\Perl\test>253229 text1 text-a text-d text2 text-b text3 text-c text-e text-f

    Updated from here on.

    As I mentioned above, if you dataset is too large to fit in memory (not forgetting the memory overhead of the array and hash), then you will need a different solution. I originally had a scheme in mind of writing several output files and then having a merge phase to combine them but there are several problems with this.

    • The first is deciding when to switch to a new output file

      You could just choose an arbitrary number of records, but unless your input records are of a consistant size, then you could still run into problems unles you set the limit quiet low, and then the merge phase gets more complicated.

      You could use Devel::Peek to track the amount of memory being used and write as you approach some maximum. How you decide this might mean resorting to empirical testing.

    • The second problem is the need to retain the ordering between the data files.

      This requires retaining the @order array in memory, which further limits the size of the %hash.

      It also means that if you need to process the input files in several runs, you would need to serialise @order to a seperate file between runs--messy.

    Then I remembered reading about mjd's Tie::File module. Don't be fooled by the unassuming name, this module is worth its weight in latinum:)

    Essentially, Tie::File allows you to use an abitrarially huge arrays from within perl, with all the familiar features of perl arrays--including splice!

    This module allows almost the same algorithm as used above to be used regardless (subject to OS filesize limits) of the size of the dataset involved. The only change required is to accumulate the records in the array rather than the %hash, which in turn means that you store the index of the array elements matching the key in the hash value.

    I've also added a command line switch that will allow the input files to be processed in any number fo passes. Actually, it will do this by default if the output file exists. Use the switch (-NEWRUN) to empty the (hardcoded) output file if it exists. The demo currently reads __DATA__ each run as is, change this to while (<>) { and supply a list of filenames (or a wildcard under *nix, or add -MG if you have jenda's G.pm module under Win32).

    #! perl -slw use strict; use Tie::File; use vars qw[$NEWRUN]; use constant OUTPUT_FILE => './output.dat'; # Empty the output file if this is a new run. (-NEWRUN on the command +line) # If this switch isn't present, then new data will be accumulated onto + the existing. unlink OUTPUT_FILE if $NEWRUN; tie my @accumulator, 'Tie::File', OUTPUT_FILE, memory => 20_000_000; # Adjust as required. See Tie::File pod for +other options. my %hash; # This line preloads the ordering info into the hash if this isn't a n +ew run unless($NEWRUN) { $hash{ (split/\t/, $accumulator[$_])[0] } = $_ for 0 .. $#accumula +tor ; } while ( <DATA> ) { # switching this to <> would allow a list of files +to be supplied chomp; my @bits = split /\t/; unless (exists $hash{$bits[0]}) { # unless we saw this type +already push @accumulator, $bits[0] . "\t"; # Add it to the end of the + array (file) $hash{$bits[0]} = $#accumulator; # And remember where in th +e hash } #append the new but to the appropriate array element (file record) +. $accumulator[ $hash{ $bits[0] } ] .= ' ' . $bits[1]; } untie @accumulator; __DATA__ text1 text-a text2 text-b text3 text-c text1 text-d text3 text-e text3 text-f

    sample output

    Note: This code is really walking on the shoulders of a giant. Dominus deserves the credit and any XP you wish to contribute. Pick a random node and vote:)


    Examine what is said, not who speaks.
    1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.
    2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible
    3) Any sufficiently advanced technology is indistinguishable from magic.
    Arthur C. Clarke.
Re: Removing redundancy
by Limbic~Region (Chancellor) on Apr 26, 2003 at 02:49 UTC
    dr_jgbn,
    When I see the words ultra efficient script - I think that this is either:
  • A case not to use Perl
  • Perl is fine and you are paranoid about premature optimization

    With that said and given your example data - I would:

    #!/usr/bin/perl -w use strict; my %seen; my @data; open (INPUT,"file") or die "Error accessing file : $!"; while (<INPUT>) { chomp; my ($col1, $col2) = split /\t/; if (exists $seen{$col1}) { $data[$seen{$col1} - 1] .= " $col2"; } else { $seen{$col1} = push @data , "$col1\t$col2"; } } print "$_\n" foreach(@data);

    Cheers - L~R

    Update: Completely re-wrote code to be unique as my original solution closely mirrored others

  • Use array instead of hash to store data
  • Use hash to determine array element
  • Use fact that push returns new number of elements to store in hash
    A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Removing redundancy
by samurai (Monk) on Apr 26, 2003 at 02:49 UTC
    I believe this will work (untested):

    my @order = (); my %hash = (); while (<DATA>) { # split line on tab my @data = split /\t/; # for keeping the order push @order, $data[0] unless $hash{$data[0]}; # make an array ref if there isnt' one $hash{$data[0]} = [] unless $hash{$data[0]}; # store it away for later push @{$hash{$data[0]}}, $data[1]; } # keeping the order here for my $col1 (@order) { # get that stored array my $aref = $hash{$col1}; # print tab-delimited, grouped on col1 and in order print join("\t", $col1, @$aref), "\n"'; }

    From what I gather what you asked, this is what you're looking for. If I didn't get it right you need to be more specific ^_^

    --
    perl: code of the samurai

Re: Removing redundancy
by VSarkiss (Monsignor) on Apr 26, 2003 at 02:48 UTC

    Well, the words "huge file" make me wonder if this is the right way to go, but if you want one entry per key, that's spelled "hash" in Perl. Here's a sample (untested) that should take you in the right direction:

    my %hash; # crappy name but easy to remember while (<FILE>) # assuming FILE is open to the right place { my ($key, $value) = split /\t/; # only one tab per line push @{$hash{$key}}, $value; # magic of autovivification } # done, print everything foreach my $key (keys %hash) { print $key, " => ", join(' ', values @{$hash{$key}}, "\n"; }
    The idea is to maintain a hash of the keys (your first column) and associate each with an array of values (your second column). The main problem this may face is that hashes take a lot of memory, so depending on your definition of "huge file", this may not work.

    For more information, you may want to start with the Perl data structures cookbook.

    Update
    BrowserUK is correct below, hashes don't preserve input order. For that, the easiest modification is to use Tie::IxHash if you can take the time and memory hit.

      This doesn't preserve the input order.


      Examine what is said, not who speaks.
      1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.
      2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible
      3) Any sufficiently advanced technology is indistinguishable from magic.
      Arthur C. Clarke.
Re: Removing redundancy
by The Mad Hatter (Priest) on Apr 26, 2003 at 02:46 UTC
    Update Oops. I didn't see the "preserve order" bit. In that case, this won't do exactly what you want. Sorry. See the other (better) solutions...

    This should do what you want...

    use strict; my %data = (); while (<DATA>) { chomp; my ($col1, $col2) = split /\s+/; $data{$col1} .= "$col2 "; } print "$_: $data{$_}\n" foreach (sort keys %data); __DATA__ text1 text-a text2 text-b text3 text-c text1 text-d text3 text-e text3 text-f
    Here's the explanation:
    1. The while loop goes through each line of input. In this case, each line in <DATA>.
    2. The line is then split on any whitespace into the two columns.
    3. To ensure uniqueness, the first column is used as the key name in a hash and the second column is appended to that.
    4. Then it prints each key, and its value using a foreach statement. The keys are also sorted.
Re: Removing redundancy
by dpuu (Chaplain) on Apr 26, 2003 at 02:55 UTC
    I'm not entirely clear what you mean by "the order needs to be preserved in the first column". I assume you mean that the 1st occurence of each name should be in the same order. We can save that information in an array, while storing the col2 data in a hash:
    my @order = (); my %values; while (<IN>) { my ($col1, $col2) = split; push @order, $col1 unless exists $values{$col1}; push @{$values{$col1}}, $col2 } # and now print it: foreach my $col1 (@order) { { print join("\t", $col1, @{$values{$col1}}), "\n"; }
    --Dave
      Thank-you all for your excellent code! I have tried most of them and they work really well.

      Dr.J