Thanks for the tip!

This is bizarre, but thanks to your tip, while just trying to get a GRT solution to work, I seem to have accidentally "out-marioroy-ed" marioroy. :) I would have thought this impossible, so I've probably overlooked something, but this version runs three seconds faster on my machine than mario's astonishing dualvar solution while using slightly less memory (2,657,968K v 2,824,184K).

Here is the source code. This is just a first cut at this approach, so further improvements may be available.

# llil2grt.pl. Try a GRT version. # Example run: perl llil2grt.pl tt1.txt tt2.txt tt3.txt >out.txt use strict; use warnings; use feature qw{say}; # -------------------------------------------------------------------- +-- # LLiL specification # ------------------ # A LLiL-format file is a text file. # Each line consists of a lowercase name a TAB character and a non-neg +ative integer count. # That is, each line must match : ^[a-z]+\t\d+$ # For example, reading the LLiL-format files, tt1.txt containing: # camel\t42 # pearl\t94 # dromedary\t69 # and tt2.txt containing: # camel\t8 # hello\t12345 # dromedary\t1 # returns this hashref: # $hash_ret{"camel"} = 50 # $hash_ret{"dromedary"} = 70 # $hash_ret{"hello"} = 12345 # $hash_ret{"pearl"} = 94 # That is, values are added for items with the same key. # # To get the required LLiL text, you must sort the returned hashref # descending by value and insert a TAB separator: # hello\t12345 # pearl\t94 # dromedary\t70 # camel\t50 # To make testing via diff easier, we further sort ascending by name # for lines with the same value. # -------------------------------------------------------------------- +-- # Function get_properties # Read a list of LLiL-format files # Return a reference to a hash of properties sub get_properties { my $files = shift; # in: reference to a list of LLiL-format fil +es my %hash_ret; # out: reference to a hash of properties for my $fname ( @{$files} ) { open( my $fh, '<', $fname ) or die "error: open '$fname': $!"; while (<$fh>) { chomp; my ($word, $count) = split /\t/; $hash_ret{$word} += $count; } close($fh) or die "error: close '$fname': $!"; } return \%hash_ret; } # ----------------- mainline ----------------------------------------- +-- @ARGV or die "usage: $0 file...\n"; my @llil_files = @ARGV; warn "llil2grt start\n"; my $tstart1 = time; my $href = get_properties( \@llil_files ); my $tend1 = time; my $taken1 = $tend1 - $tstart1; warn "get_properties : $taken1 secs\n"; my $tstart2 = time; my @lines; while (my ($k, $v) = each %{$href}) { push @lines, pack('NA*', -$v, "$ +k\t$v") } say substr($_, 4) for sort @lines; my $tend2 = time; my $taken2 = $tend2 - $tstart2; my $taken = $tend2 - $tstart1; warn "sort + output : $taken2 secs\n"; warn "total : $taken secs\n";

Example Run

> perl llil2grt.pl big1.txt big2.txt big3.txt >perl2grt.tmp

llil2grt start get_properties : 10 secs sort + output : 20 secs total : 30 secs Memory use (Windows Private Bytes): 2,657,968K

> diff perl2d.tmp perl2grt.tmp

Code Differences

The only substantive differences between dualvar llil2d.pl and llil2grt.pl above are:

my @data; while ( my ($k, $v) = each %{$href} ) { push @data, dualvar($v, $k) } for my $key ( sort { $b <=> $a } sort @data ) { say "$key\t" . (0 + $key); }
vs:
my @lines; while (my ($k, $v) = each %{$href}) { push @lines, pack('NA*', -$v, "$ +k\t$v") } say substr($_, 4) for sort @lines;


In reply to Re^6: Rosetta Code: Long List is Long (Updated Solutions - GRT) by eyepopslikeamosquito
in thread Rosetta Code: Long List is Long by eyepopslikeamosquito

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.