Re^4: Rosetta Code: Long List is Long (Updated Solutions)

Replies are listed 'Best First'.
Re^5: Rosetta Code: Long List is Long (Updated Solutions) by jwkrahn (Abbot) on Dec 09, 2022 at 22:47 UTC
Perhaps add a negative sign to the numbers before sorting and remove it after?	[reply]
Re^6: Rosetta Code: Long List is Long (Updated Solutions - GRT) by eyepopslikeamosquito (Archbishop) on Dec 10, 2022 at 13:53 UTC
Thanks for the tip! This is bizarre, but thanks to your tip, while just trying to get a GRT solution to work, I seem to have accidentally "out-marioroy-ed" marioroy. :) I would have thought this impossible, so I've probably overlooked something, but this version runs three seconds faster on my machine than mario's astonishing dualvar solution while using slightly less memory (`2,657,968K` v `2,824,184K`). Here is the source code. This is just a first cut at this approach, so further improvements may be available. # llil2grt.pl. Try a GRT version. # Example run: perl llil2grt.pl tt1.txt tt2.txt tt3.txt >out.txt use strict; use warnings; use feature qw{say}; # -------------------------------------------------------------------- +-- # LLiL specification # ------------------ # A LLiL-format file is a text file. # Each line consists of a lowercase name a TAB character and a non-neg +ative integer count. # That is, each line must match : ^[a-z]+\t\d+$ # For example, reading the LLiL-format files, tt1.txt containing: # camel\t42 # pearl\t94 # dromedary\t69 # and tt2.txt containing: # camel\t8 # hello\t12345 # dromedary\t1 # returns this hashref: # $hash_ret{"camel"} = 50 # $hash_ret{"dromedary"} = 70 # $hash_ret{"hello"} = 12345 # $hash_ret{"pearl"} = 94 # That is, values are added for items with the same key. # # To get the required LLiL text, you must sort the returned hashref # descending by value and insert a TAB separator: # hello\t12345 # pearl\t94 # dromedary\t70 # camel\t50 # To make testing via diff easier, we further sort ascending by name # for lines with the same value. # -------------------------------------------------------------------- +-- # Function get_properties # Read a list of LLiL-format files # Return a reference to a hash of properties sub get_properties { my $files = shift; # in: reference to a list of LLiL-format fil +es my %hash_ret; # out: reference to a hash of properties for my $fname ( @{$files} ) { open( my $fh, '<', $fname ) or die "error: open '$fname': $!"; while (<$fh>) { chomp; my ($word, $count) = split /\t/; $hash_ret{$word} += $count; } close($fh) or die "error: close '$fname': $!"; } return \%hash_ret; } # ----------------- mainline ----------------------------------------- +-- @ARGV or die "usage: $0 file...\n"; my @llil_files = @ARGV; warn "llil2grt start\n"; my $tstart1 = time; my $href = get_properties( \@llil_files ); my $tend1 = time; my $taken1 = $tend1 - $tstart1; warn "get_properties : $taken1 secs\n"; my $tstart2 = time; my @lines; while (my ($k, $v) = each %{$href}) { push @lines, pack('NA', -$v, "$ +k\t$v") } say substr($_, 4) for sort @lines; my $tend2 = time; my $taken2 = $tend2 - $tstart2; my $taken = $tend2 - $tstart1; warn "sort + output : $taken2 secs\n"; warn "total : $taken secs\n"; [download] Example Run* `> perl llil2grt.pl big1.txt big2.txt big3.txt >perl2grt.tmp` [download] `llil2grt start get_properties : 10 secs sort + output : 20 secs total : 30 secs Memory use (Windows Private Bytes): 2,657,968K` [download] `> diff perl2d.tmp perl2grt.tmp` [download] Code Differences The only substantive differences between dualvar llil2d.pl and `llil2grt.pl` above are: `my @data; while ( my ($k, $v) = each %{$href} ) { push @data, dualvar($v, $k) } for my $key ( sort { $b <=> $a } sort @data ) { say "$key\t" . (0 + $key); }` [download] vs: `my @lines; while (my ($k, $v) = each %{$href}) { push @lines, pack('NA*', -$v, "$ +k\t$v") } say substr($_, 4) for sort @lines;` [download]	[reply] [d/l] [select]
Re^7: Rosetta Code: Long List is Long (Updated Solutions - short Perl GRT and for_list) by eyepopslikeamosquito (Archbishop) on Jan 02, 2023 at 09:58 UTC
llil2cmd.pl - abbreviated version of llil2grt.pl For cheap thrills, I created `llil2cmd.pl`, a short command line version of `llil2grt.pl`: `#!perl -n # llil2cmd.pl. Abbreviated version of llil2grt.pl. chomp; ($w,$c) = split/\t/; $h{$w} += $c; END { $\=$/; push @l, pack('NA',-$v,"$k\t$v") while ($k,$v)=each %h; print substr($_,4) for sort @l; }` [download] Curiously, this abbreviated version runs at about the same speed on Windows, but significantly faster on my Ubuntu Linux VM: `> time perl llil2grt.pl big1.txt big2.txt big3.txt >grt1.tmp llil2grt start get_properties : 8 secs sort + output : 22 secs total : 30 secs real 0m33.475s user 0m32.180s sys 0m1.295s` [download] `> time perl llil2cmd.pl big1.txt big2.txt big3.txt >cmd1.tmp real 0m28.937s user 0m27.843s sys 0m1.093s > diff cmd1.tmp grt1.tmp` [download] To get more detailed timings, I hacked out a long short version: `#!perl -n # llil2cmd-long.pl. A long short version of llil2grt.pl. BEGIN { $tstart1 = time; } chomp; ($w,$c) = split/\t/; $h{$w} += $c; END { my $tstart2 = time; $\=$/; push @l, pack('NA',-$v,"$k\t$v") while ($k,$v)=each %h; print substr($_,4) for sort @l; my $tend2 = time; my $taken1 = $tstart2 - $tstart1; my $taken2 = $tend2 - $tstart2; my $taken = $tend2 - $tstart1; warn "get_properties : $taken1 secs\n"; warn "sort + output : $taken2 secs\n"; warn "total : $taken secs\n"; }` [download] `$ time perl llil2cmd-long.pl big1.txt big2.txt big3.txt >long1.tmp get_properties : 7 secs sort + output : 21 secs total : 28 secs real 0m28.629s user 0m27.707s sys 0m0.917s > diff long1.tmp grt1.tmp` [download] As you can see from the times reported by the Linux `time` command, it seems that large lexical variables in Perl are significantly slower to cleanup at program exit than non-lexicals (about three seconds slower in this example: 33.475s vs 30 secs for `llil2grt.pl`, 28.629s vs 28 secs for `llil2cmd-long.pl`). New perl 5.36 experimental `for_list` feature After stumbling upon perl 5.36 and the for_list feature - a simple speed comparison I had to give the perl 5.36 `for_list` feature a try (update: List::Util's `pairmap` might be worth a try given it was mentioned in a reply). After building perl v5.36 from source (my Ubuntu system perl is v5.34 - update see improved build perl 5.38 notes): `wget https://www.cpan.org/src/5.0/perl-5.36.0.tar.gz (update: run sha256sum perl-5.36.0.tar.gz and check matches https://ww +w.cpan.org/src/5.0/perl-5.36.0.tar.gz.sha256.txt) tar -xzf perl-5.36.0.tar.gz cd perl-5.36.0 ./Configure -des -Dprefix=$HOME/localperl make 2>&1 \| tee make.tmp make test 2>&1 \| tee test.tmp make install 2>&1 \| tee install.tmp` [download] and adding: `use 5.036; use experimental qw/for_list declared_refs/;` [download] to the top of `llil2grt.pl` while changing one line from: `while (my ($k, $v) = each %{$href}) { push @lines, pack('NA', -$v, "$ +k\t$v") }` [download] to: `for my ($k, $v) (%{$href}) { push @lines, pack('NA', -$v, "$k\t$v") }` [download] it produced the same result, but did not run appreciably faster. Update: as for why it isn't much faster, see ikegami's replies at: Re^2: Why does each() always re-evaluate its argument? (Updated x2 - experimental "for_list" ) Update: Improved Ubuntu Perl Build Notes Re^7: Meaning of XS object version (Package Manager Security References - example building Perl securely from source) - example (secure) build of perl v5.38.0 from source on my Ubuntu Linux VM Re^5: Size of Judy::HS array: where is MemUsed()? - perldelta, Perl Releases and Building Perl - notes on building Perl from source on my Linux VM Manual install of CPAN Roman module Later I manually installed Roman by `CHORNY` from CPAN into this local non-root Perl 5.36 as follows: `$ cd $HOME/localperlmodules $ type perl perl is hashed ($HOME/localperl/bin/perl) $ wget https://www.cpan.org/modules/by-module/Roman/Roman-1.24.tar.gz $ tar -xzf Roman-1.24.tar.gz $ cd Roman-1.24 $ perl Makefile.PL 2>&1 \| tee make.tmp $ make 2>&1 \| tee make.tmp $ make test 2>&1 \| tee test.tmp $ make install 2>&1 \| tee install.tmp` [download] Update: Better to do it via: `cpanm --from https://www.cpan.org/ --verify Roman 2>&1 \| tee Roman.tmp` Updated: Added steps for building perl v5.36.0 from source and manual install of Roman module. Noted that large lexical variables are slower to cleanup at program exit.	[reply] [d/l] [select]