Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Re^6: Rosetta Code: Long List is Long (Updated Solutions - GRT)

by eyepopslikeamosquito (Archbishop)
on Dec 10, 2022 at 13:53 UTC ( [id://11148713]=note: print w/replies, xml ) Need Help??


in reply to Re^5: Rosetta Code: Long List is Long (Updated Solutions)
in thread Rosetta Code: Long List is Long

Thanks for the tip!

This is bizarre, but thanks to your tip, while just trying to get a GRT solution to work, I seem to have accidentally "out-marioroy-ed" marioroy. :) I would have thought this impossible, so I've probably overlooked something, but this version runs three seconds faster on my machine than mario's astonishing dualvar solution while using slightly less memory (2,657,968K v 2,824,184K).

Here is the source code. This is just a first cut at this approach, so further improvements may be available.

# llil2grt.pl. Try a GRT version. # Example run: perl llil2grt.pl tt1.txt tt2.txt tt3.txt >out.txt use strict; use warnings; use feature qw{say}; # -------------------------------------------------------------------- +-- # LLiL specification # ------------------ # A LLiL-format file is a text file. # Each line consists of a lowercase name a TAB character and a non-neg +ative integer count. # That is, each line must match : ^[a-z]+\t\d+$ # For example, reading the LLiL-format files, tt1.txt containing: # camel\t42 # pearl\t94 # dromedary\t69 # and tt2.txt containing: # camel\t8 # hello\t12345 # dromedary\t1 # returns this hashref: # $hash_ret{"camel"} = 50 # $hash_ret{"dromedary"} = 70 # $hash_ret{"hello"} = 12345 # $hash_ret{"pearl"} = 94 # That is, values are added for items with the same key. # # To get the required LLiL text, you must sort the returned hashref # descending by value and insert a TAB separator: # hello\t12345 # pearl\t94 # dromedary\t70 # camel\t50 # To make testing via diff easier, we further sort ascending by name # for lines with the same value. # -------------------------------------------------------------------- +-- # Function get_properties # Read a list of LLiL-format files # Return a reference to a hash of properties sub get_properties { my $files = shift; # in: reference to a list of LLiL-format fil +es my %hash_ret; # out: reference to a hash of properties for my $fname ( @{$files} ) { open( my $fh, '<', $fname ) or die "error: open '$fname': $!"; while (<$fh>) { chomp; my ($word, $count) = split /\t/; $hash_ret{$word} += $count; } close($fh) or die "error: close '$fname': $!"; } return \%hash_ret; } # ----------------- mainline ----------------------------------------- +-- @ARGV or die "usage: $0 file...\n"; my @llil_files = @ARGV; warn "llil2grt start\n"; my $tstart1 = time; my $href = get_properties( \@llil_files ); my $tend1 = time; my $taken1 = $tend1 - $tstart1; warn "get_properties : $taken1 secs\n"; my $tstart2 = time; my @lines; while (my ($k, $v) = each %{$href}) { push @lines, pack('NA*', -$v, "$ +k\t$v") } say substr($_, 4) for sort @lines; my $tend2 = time; my $taken2 = $tend2 - $tstart2; my $taken = $tend2 - $tstart1; warn "sort + output : $taken2 secs\n"; warn "total : $taken secs\n";

Example Run

> perl llil2grt.pl big1.txt big2.txt big3.txt >perl2grt.tmp

llil2grt start get_properties : 10 secs sort + output : 20 secs total : 30 secs Memory use (Windows Private Bytes): 2,657,968K

> diff perl2d.tmp perl2grt.tmp

Code Differences

The only substantive differences between dualvar llil2d.pl and llil2grt.pl above are:

my @data; while ( my ($k, $v) = each %{$href} ) { push @data, dualvar($v, $k) } for my $key ( sort { $b <=> $a } sort @data ) { say "$key\t" . (0 + $key); }
vs:
my @lines; while (my ($k, $v) = each %{$href}) { push @lines, pack('NA*', -$v, "$ +k\t$v") } say substr($_, 4) for sort @lines;

Replies are listed 'Best First'.
Re^7: Rosetta Code: Long List is Long (Updated Solutions - short Perl GRT and for_list)
by eyepopslikeamosquito (Archbishop) on Jan 02, 2023 at 09:58 UTC

    llil2cmd.pl - abbreviated version of llil2grt.pl

    For cheap thrills, I created llil2cmd.pl, a short command line version of llil2grt.pl:

    #!perl -n # llil2cmd.pl. Abbreviated version of llil2grt.pl. chomp; ($w,$c) = split/\t/; $h{$w} += $c; END { $\=$/; push @l, pack('NA*',-$v,"$k\t$v") while ($k,$v)=each %h; print substr($_,4) for sort @l; }

    Curiously, this abbreviated version runs at about the same speed on Windows, but significantly faster on my Ubuntu Linux VM:

    > time perl llil2grt.pl big1.txt big2.txt big3.txt >grt1.tmp llil2grt start get_properties : 8 secs sort + output : 22 secs total : 30 secs real 0m33.475s user 0m32.180s sys 0m1.295s

    > time perl llil2cmd.pl big1.txt big2.txt big3.txt >cmd1.tmp real 0m28.937s user 0m27.843s sys 0m1.093s > diff cmd1.tmp grt1.tmp

    To get more detailed timings, I hacked out a long short version:

    #!perl -n # llil2cmd-long.pl. A long short version of llil2grt.pl. BEGIN { $tstart1 = time; } chomp; ($w,$c) = split/\t/; $h{$w} += $c; END { my $tstart2 = time; $\=$/; push @l, pack('NA*',-$v,"$k\t$v") while ($k,$v)=each %h; print substr($_,4) for sort @l; my $tend2 = time; my $taken1 = $tstart2 - $tstart1; my $taken2 = $tend2 - $tstart2; my $taken = $tend2 - $tstart1; warn "get_properties : $taken1 secs\n"; warn "sort + output : $taken2 secs\n"; warn "total : $taken secs\n"; }

    $ time perl llil2cmd-long.pl big1.txt big2.txt big3.txt >long1.tmp get_properties : 7 secs sort + output : 21 secs total : 28 secs real 0m28.629s user 0m27.707s sys 0m0.917s > diff long1.tmp grt1.tmp

    As you can see from the times reported by the Linux time command, it seems that large lexical variables in Perl are significantly slower to cleanup at program exit than non-lexicals (about three seconds slower in this example: 33.475s vs 30 secs for llil2grt.pl, 28.629s vs 28 secs for llil2cmd-long.pl).

    New perl 5.36 experimental for_list feature

    After stumbling upon perl 5.36 and the for_list feature - a simple speed comparison I had to give the perl 5.36 for_list feature a try (update: List::Util's pairmap might be worth a try given it was mentioned in a reply).

    After building perl v5.36 from source (my Ubuntu system perl is v5.34 - update see improved build perl 5.38 notes):

    wget https://www.cpan.org/src/5.0/perl-5.36.0.tar.gz (update: run sha256sum perl-5.36.0.tar.gz and check matches https://ww +w.cpan.org/src/5.0/perl-5.36.0.tar.gz.sha256.txt) tar -xzf perl-5.36.0.tar.gz cd perl-5.36.0 ./Configure -des -Dprefix=$HOME/localperl make 2>&1 | tee make.tmp make test 2>&1 | tee test.tmp make install 2>&1 | tee install.tmp
    and adding:
    use 5.036; use experimental qw/for_list declared_refs/;
    to the top of llil2grt.pl while changing one line from:
    while (my ($k, $v) = each %{$href}) { push @lines, pack('NA*', -$v, "$ +k\t$v") }
    to:
    for my ($k, $v) (%{$href}) { push @lines, pack('NA*', -$v, "$k\t$v") }
    it produced the same result, but did not run appreciably faster.

    Update: as for why it isn't much faster, see ikegami's replies at: Re^2: Why does each() always re-evaluate its argument? (Updated x2 - experimental "for_list" )

    Update: Improved Ubuntu Perl Build Notes

    Manual install of CPAN Roman module

    Later I manually installed Roman by CHORNY from CPAN into this local non-root Perl 5.36 as follows:

    $ cd $HOME/localperlmodules $ type perl perl is hashed ($HOME/localperl/bin/perl) $ wget https://www.cpan.org/modules/by-module/Roman/Roman-1.24.tar.gz $ tar -xzf Roman-1.24.tar.gz $ cd Roman-1.24 $ perl Makefile.PL 2>&1 | tee make.tmp $ make 2>&1 | tee make.tmp $ make test 2>&1 | tee test.tmp $ make install 2>&1 | tee install.tmp

    Update: Better to do it via: cpanm --from https://www.cpan.org/ --verify Roman 2>&1 | tee Roman.tmp

    Updated: Added steps for building perl v5.36.0 from source and manual install of Roman module. Noted that large lexical variables are slower to cleanup at program exit.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11148713]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (1)
As of 2024-04-19 00:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found