Re^6: Rosetta Code: Long List is Long (Updated Solutions

Thanks for the tip!

This is bizarre, but thanks to your tip, while just trying to get a GRT solution to work, I seem to have accidentally "out-marioroy-ed" marioroy. :) I would have thought this impossible, so I've probably overlooked something, but this version runs three seconds faster on my machine than mario's astonishing dualvar solution while using slightly less memory (2,657,968K v 2,824,184K).

Here is the source code. This is just a first cut at this approach, so further improvements may be available.

# llil2grt.pl. Try a GRT version.
# Example run: perl llil2grt.pl tt1.txt tt2.txt tt3.txt >out.txt

use strict;
use warnings;
use feature qw{say};

# --------------------------------------------------------------------
+--
# LLiL specification
# ------------------
# A LLiL-format file is a text file.
# Each line consists of a lowercase name a TAB character and a non-neg
+ative integer count.
# That is, each line must match : ^[a-z]+\t\d+$
# For example, reading the LLiL-format files, tt1.txt containing:
#   camel\t42
#   pearl\t94
#   dromedary\t69
# and tt2.txt containing:
#   camel\t8
#   hello\t12345
#   dromedary\t1
# returns this hashref:
#   $hash_ret{"camel"}     = 50
#   $hash_ret{"dromedary"} = 70
#   $hash_ret{"hello"}     = 12345
#   $hash_ret{"pearl"}     = 94
# That is, values are added for items with the same key.
#
# To get the required LLiL text, you must sort the returned hashref
# descending by value and insert a TAB separator:
#   hello\t12345
#   pearl\t94
#   dromedary\t70
#   camel\t50
# To make testing via diff easier, we further sort ascending by name
# for lines with the same value.
# --------------------------------------------------------------------
+--

# Function get_properties
# Read a list of LLiL-format files
# Return a reference to a hash of properties
sub get_properties
{
   my $files = shift;    # in:  reference to a list of LLiL-format fil
+es
   my %hash_ret;         # out: reference to a hash of properties
   for my $fname ( @{$files} ) {
      open( my $fh, '<', $fname ) or die "error: open '$fname': $!";
      while (<$fh>) {
         chomp;
         my ($word, $count) = split /\t/;
         $hash_ret{$word} += $count;
      }
      close($fh) or die "error: close '$fname': $!";
   }
   return \%hash_ret;
}

# ----------------- mainline -----------------------------------------
+--

@ARGV or die "usage: $0 file...\n";
my @llil_files = @ARGV;

warn "llil2grt start\n";
my $tstart1 = time;
my $href    = get_properties( \@llil_files );
my $tend1   = time;
my $taken1  = $tend1 - $tstart1;
warn "get_properties : $taken1 secs\n";

my $tstart2 = time;

my @lines;
while (my ($k, $v) = each %{$href}) { push @lines, pack('NA*', -$v, "$
+k\t$v") }
say substr($_, 4) for sort @lines;

my $tend2  = time;
my $taken2 = $tend2 - $tstart2;
my $taken  = $tend2 - $tstart1;

warn "sort + output  : $taken2 secs\n";
warn "total          : $taken secs\n";
[download]

Example Run

> perl llil2grt.pl big1.txt big2.txt big3.txt >perl2grt.tmp
[download]

llil2grt start
get_properties : 10 secs
sort + output  : 20 secs
total          : 30 secs
Memory use (Windows Private Bytes): 2,657,968K
[download]

> diff perl2d.tmp perl2grt.tmp
[download]

Code Differences

The only substantive differences between dualvar llil2d.pl and llil2grt.pl above are:

my @data;
while ( my ($k, $v) = each %{$href} ) { push @data, dualvar($v, $k) }

for my $key ( sort { $b <=> $a } sort @data ) {
   say "$key\t" . (0 + $key);
}
[download]

vs:

my @lines;
while (my ($k, $v) = each %{$href}) { push @lines, pack('NA*', -$v, "$
+k\t$v") }

say substr($_, 4) for sort @lines;
[download]

Comment on Re^6: Rosetta Code: Long List is Long (Updated Solutions - GRT) Select or Download Code

Replies are listed 'Best First'.

Re^7: Rosetta Code: Long List is Long (Updated Solutions - short Perl GRT and for_list)
by eyepopslikeamosquito (Archbishop) on Jan 02, 2023 at 09:58 UTC

llil2cmd.pl - abbreviated version of llil2grt.pl

For cheap thrills, I created llil2cmd.pl, a short command line version of llil2grt.pl:

#!perl -n
# llil2cmd.pl. Abbreviated version of llil2grt.pl.
chomp; ($w,$c) = split/\t/; $h{$w} += $c;
END {
   $\=$/;
   push @l, pack('NA*',-$v,"$k\t$v") while ($k,$v)=each %h;
   print substr($_,4) for sort @l;
}
[download]

Curiously, this abbreviated version runs at about the same speed on Windows, but significantly faster on my Ubuntu Linux VM:

> time perl llil2grt.pl big1.txt big2.txt big3.txt >grt1.tmp
llil2grt start
get_properties : 8 secs
sort + output  : 22 secs
total          : 30 secs

real    0m33.475s
user    0m32.180s
sys     0m1.295s
[download]

> time perl llil2cmd.pl big1.txt big2.txt big3.txt >cmd1.tmp

real    0m28.937s
user    0m27.843s
sys     0m1.093s

> diff cmd1.tmp grt1.tmp
[download]

To get more detailed timings, I hacked out a long short version:

#!perl -n
# llil2cmd-long.pl. A long short version of llil2grt.pl.
BEGIN {
   $tstart1 = time;
}
chomp; ($w,$c) = split/\t/; $h{$w} += $c;
END {
   my $tstart2 = time;
   $\=$/;
   push @l, pack('NA*',-$v,"$k\t$v") while ($k,$v)=each %h;
   print substr($_,4) for sort @l;
   my $tend2  = time;
   my $taken1 = $tstart2 - $tstart1;
   my $taken2 = $tend2   - $tstart2;
   my $taken  = $tend2   - $tstart1;
   warn "get_properties : $taken1 secs\n";
   warn "sort + output  : $taken2 secs\n";
   warn "total          : $taken secs\n";
}
[download]

$ time perl llil2cmd-long.pl big1.txt big2.txt big3.txt >long1.tmp
get_properties : 7 secs
sort + output  : 21 secs
total          : 28 secs

real    0m28.629s
user    0m27.707s
sys     0m0.917s

> diff long1.tmp grt1.tmp
[download]

As you can see from the times reported by the Linux time command, it seems that large lexical variables in Perl are significantly slower to cleanup at program exit than non-lexicals (about three seconds slower in this example: 33.475s vs 30 secs for llil2grt.pl, 28.629s vs 28 secs for llil2cmd-long.pl).

New perl 5.36 experimental for_list feature

After stumbling upon perl 5.36 and the for_list feature - a simple speed comparison I had to give the perl 5.36 for_list feature a try (update: List::Util's pairmap might be worth a try given it was mentioned in a reply).

After building perl v5.36 from source (my Ubuntu system perl is v5.34 - update see improved build perl 5.38 notes):

wget https://www.cpan.org/src/5.0/perl-5.36.0.tar.gz
(update: run sha256sum perl-5.36.0.tar.gz and check matches https://ww
+w.cpan.org/src/5.0/perl-5.36.0.tar.gz.sha256.txt)
tar -xzf perl-5.36.0.tar.gz
cd perl-5.36.0
./Configure -des -Dprefix=$HOME/localperl
make 2>&1 | tee make.tmp
make test 2>&1 | tee test.tmp
make install 2>&1 | tee install.tmp
[download]

use 5.036;
use experimental qw/for_list declared_refs/;
[download]

llil2grt.pl

while (my ($k, $v) = each %{$href}) { push @lines, pack('NA*', -$v, "$
+k\t$v") }
[download]

for my ($k, $v) (%{$href}) { push @lines, pack('NA*', -$v, "$k\t$v") }
[download]

Update: as for why it isn't much faster, see ikegami's replies at: Re^2: Why does each() always re-evaluate its argument? (Updated x2 - experimental "for_list" )

Update: Improved Ubuntu Perl Build Notes

Re^7: Meaning of XS object version (CPAN and Package Manager Security References) - example (secure) build of perl v5.38.0 from source on my Ubuntu Linux VM
Re^5: Size of Judy::HS array: where is MemUsed()? - perldelta, Perl Releases and Building Perl - notes on building Perl from source on my Linux VM

Manual install of CPAN Roman module

Later I manually installed Roman by CHORNY from CPAN into this local non-root Perl 5.36 as follows:

$ cd $HOME/localperlmodules
$ type perl
perl is hashed ($HOME/localperl/bin/perl)
$ wget https://www.cpan.org/modules/by-module/Roman/Roman-1.24.tar.gz
$ tar -xzf Roman-1.24.tar.gz
$ cd Roman-1.24
$ perl Makefile.PL 2>&1 | tee make.tmp
$ make 2>&1 | tee make.tmp
$ make test 2>&1 | tee test.tmp
$ make install 2>&1 | tee install.tmp
[download]

Update: Better to do it via: cpanm --from https://www.cpan.org/ --verify Roman 2>&1 | tee Roman.tmp

Updated: Added steps for building perl v5.36.0 from source and manual install of Roman module. Noted that large lexical variables are slower to cleanup at program exit.

[reply]
[d/l]
[select]


XP is just a number
	PerlMonks

Re^6: Rosetta Code: Long List is Long (Updated Solutions - GRT)