Re: Rosetta Code: Long List is Long -- Python

Environment: FreeBSD stable/13 (amd64) as guest in VirtualBox 6.1.30 on Windows 10 host; 2 CPUs & ~8 GB of RAM are allocated to the VM on ThinkPad X260 (Intel i5-6500U CPU, 16 GB RAM). perl is 5.32.1; python is 3.10.8.

I may attempt at a Rust, which is I am currently learning, version.

Your Perl version takes ~7-8 minute (time to collect data is same enough as my Python version). Mind that time values are of just one run each.

llil start
get_properties : 44 secs
sort + output  : 429 secs
total          : 473 secs
[download]

From choroba's attempt ...

start
get properties: 60 secs
sort + output: 63 secs
total: 123 secs
[download]

From my Python attempt (generally takes ~3-4 minute) ...

# Manually rounded off after posting.
collect time : 37.6 s
sort_via_cmp_to_key time : 115.1 s
sort+format time : 115.1 s
total processing time : 152.7 s
output time : 22.0 s
total time : 174.7 s
[download]

... most of time savings (of ~10-20 s) over my earlier Python attempts came from ...

printing the output as one string instead of printing each line in a loop;
sorting the dict via keys instead of sorting an iterator of tuples created from the "dict";
and using subtraction to compare the numbers instead of explicitly returning one of -1, 0, 1 based on [><=] comparison.

Later I realized there was one extraneous loop; removed now, but the patch keeps the history.

#!/usr/local/bin/python3.10

# This is one Python implementation based on the problem specification
+ ...
#
#   Rosetta Code: Long List is Long, 20221130,
#   by eyepopslikeamosquito
#   https://perlmonks.org/?node_id=11148465
#
# --------------------------------------------------------------------
+--
# LLiL specification
# ------------------
# A LLiL-format file is a text file.
# Each line consists of a lowercase name a TAB character and a non-neg
+ative integer count.
# That is, each line must match : ^[a-z]+\t\d+$
# For example, reading the LLiL-format files, tt1.txt containing:
#   camel\t42
#   pearl\t94
#   dromedary\t69
# and tt2.txt containing:
#   camel\t8
#   hello\t12345
#   dromedary\t1
# returns this hashref:
#   $hash_ret{"camel"}     = 50
#   $hash_ret{"dromedary"} = 70
#   $hash_ret{"hello"}     = 12345
#   $hash_ret{"pearl"}     = 94
# That is, values are added for items with the same key.
#
# To get the required LLiL text, you must sort the returned hashref
# descending by value and insert a TAB separator:
#   hello\t12345
#   pearl\t94
#   dromedary\t70
#   camel\t50
# To make testing via diff easier, we further sort ascending by name
# for lines with the same value.
# --------------------------------------------------------------------
+--


import sys
from functools import cmp_to_key
from pathlib import PosixPath
from timeit import default_timer
from typing import Generator


# Collect time.
time_stat = dict()


# Takes ~35-46 s.
def collect( data_list :list ) ->dict[ str, int ]:
   """
   Returns:
      A `dict` with category as key & total count as value.

   Args:
      data_list: list of file paths as `pathlib.PosixPath` objects.

   Side effect:
      Updates `time_stat` with the time taken to collect the data.
   """
   start = default_timer()

   cat_count = dict()

   delimiter = '\t'
   open_prop = { 'mode' : 'rt',
                  'encoding' : 'ascii',
                  'newline' : '\n'
               }
   for path in data_list:
      with path.open( **open_prop ) as fh:
         for line in fh:
            category, number = line.split( delimiter, 1 )
            if category not in cat_count.keys():
               cat_count[ category ] = 0
            cat_count[ category ] += int( number )

   time_stat['collect time'] = default_timer() - start
   return cat_count


# Time taken to create a "Generator" of formatted strings gets lost in
+ decimal
# place; largely is contributed by "sort_via_cmp_to_key" function.
def process( cat_count :dict[ str, int ] ) ->Generator[ str, None, Non
+e ]:
   """
   Returns:
      A generator of strings composed of category & total count, sorte
+d by
      total count in descending order & by category in ascending order
+.

   Args:
      cat_count: `dict` of category as key & total count as value.

   Side effect:
      Updates `time_stat` dict with the time taken to sort & format th
+e data.
   """
   start = default_timer()

   formatted = ( f'{k}\t{cat_count[ k ]}'
                  for k in sort_via_cmp_to_key( cat_count )
               )

   time_stat['sort+format time'] = default_timer() - start
   return formatted


# Takes ~103-115 s.
def sort_via_cmp_to_key( cat_count :dict[ str, int ] ) ->list[ str ]:
   """
   Returns:
      A a `list` of sorted keys; sorted by total count in descending o
+rder & by
      category in ascending order.

   Args:
      cat_count: `dict` of category as key & total count as value.

   Side effect:
      Updates `time_stat` with the time taken to sort.
   """
   start = default_timer()

   # Set up old fashioned comparison function.
   def compare_desc_count_asc_cat( a :str, b :str  ):
      # key: category;
      # value: total count.
      # Ascending category order.
      if cat_count[ a ] == cat_count[ b ]:
         if a < b:
            return -1
         if a > b:
            return 1
         return 0
      # Descending count order.
      return cat_count[ b ] - cat_count[ a ]
   #
   comparator = cmp_to_key( compare_desc_count_asc_cat )

   out = sorted( cat_count.keys(), key = comparator )

   time_stat['sort_via_cmp_to_key time'] = default_timer() - start
   return out


# Collect file paths.
if sys.argv[1:]:
   data_list = [ PosixPath( p ) for p in sys.argv[1:] ]
else:
   sys.exit( 'Give a file list with data to process' )


process_start_time = default_timer()
processed = process( collect( data_list ) )
process_end_time = default_timer()

# Faster than printing one by one.
# Takes ~14-25 s.
print( '\n'.join( processed ) )
output_end_time = default_timer()

time_stat['total processing time'] = process_end_time - process_start_
+time
time_stat['output time'] = output_end_time - process_end_time
time_stat['total time'] = output_end_time - process_start_time
#
decimal_place = 1
round_format = f'.{decimal_place}f'
time_width = max( len( f'{t:{round_format}}' ) for t in time_stat.valu
+es() ) \
            + decimal_place +1
label_width = max( len( label ) for label in time_stat.keys() )
#
for label, time_taken in time_stat.items():
   sys.stderr.write( f'{label:{label_width}}: {time_taken:{time_width}
+{round_format}} s\n' )
[download]

... said patch (this filler text is to better separate from preceding code portion) ...

@@ -98,20 +98,20 @@
    """
    start = default_timer()
 
-   formatted = ( f'{sk}\t{sv}'
-                  for sk, sv in sort_via_cmp_to_key( cat_count )
+   formatted = ( f'{k}\t{cat_count[ k ]}'
+                  for k in sort_via_cmp_to_key( cat_count )
                )
 
    time_stat['sort+format time'] = default_timer() - start
    return formatted
 
 
-# Takes ~115-120 s.
-def sort_via_cmp_to_key( cat_count :dict[ str, int ] ) ->Generator[ t
+uple[ str, int ], None, None ]:
+# Takes ~103-115 s.
+def sort_via_cmp_to_key( cat_count :dict[ str, int ] ) ->list[ str ]:
    """
    Returns:
-      A generator of 2-element `tuple`s of category & total count, so
+rted by
-      total count in descending order & by category in ascending orde
+r.
+      A a `list` of sorted keys; sorted by total count in descending 
+order & by
+      category in ascending order.
 
    Args:
       cat_count: `dict` of category as key & total count as value.
@@ -137,9 +137,7 @@
    #
    comparator = cmp_to_key( compare_desc_count_asc_cat )
 
-   out = ( ( k, cat_count[ k ] )
-            for k in sorted( cat_count.keys(), key = comparator )
-         )
+   out = sorted( cat_count.keys(), key = comparator )
 
    time_stat['sort_via_cmp_to_key time'] = default_timer() - start
    return out
@@ -165,6 +163,12 @@
 time_stat['output time'] = output_end_time - process_end_time
 time_stat['total time'] = output_end_time - process_start_time
 #
+decimal_place = 1
+round_format = f'.{decimal_place}f'
+time_width = max( len( f'{t:{round_format}}' ) for t in time_stat.val
+ues() ) \
+            + decimal_place +1
+label_width = max( len( label ) for label in time_stat.keys() )
+#
 for label, time_taken in time_stat.items():
-   sys.stderr.write( f'{label} : {time_taken:0.1f}\n' )
+   sys.stderr.write( f'{label:{label_width}}: {time_taken:{time_width
+}{round_format}} s\n' )
[download]

Comment on Re: Rosetta Code: Long List is Long -- Python Select or Download Code

Replies are listed 'Best First'.
Re^2: Rosetta Code: Long List is Long -- Python by marioroy (Prior) on Dec 09, 2022 at 12:01 UTC
This is neat, parv. I compared Python 3.10.7 against Pyston-lite and Pyston. Python: `$ python3 llil.py big1.txt big2.txt big3.txt > out1.txt collect time : 5.2 s sort_via_cmp_to_key time: 17.7 s sort+format time : 17.7 s total processing time : 22.9 s output time : 3.5 s total time : 26.4 s` [download] Python with Pyston-lite loaded automatically: `# Install pyston_lite_autoload to ~/.local/lib/python3.10/site-package +s/. $ python3 -m pip install --user pyston_lite_autoload $ python3 llil.py big1.txt big2.txt big3.txt > out2.txt collect time : 4.4 s sort_via_cmp_to_key time: 15.8 s sort+format time : 15.8 s total processing time : 20.2 s output time : 3.2 s total time : 23.4 s` [download] Pyston 2.3.5: I also tried Pyston 2.3.5. But first, I needed to modify the function declarations to resolve "TypeError: 'type' object is not subscriptable. `$ diff llil.py llil2.py 53c53 < def collect( data_list :list ) ->dict[ str, int ]: --- > def collect( data_list :list ) ->dict: 87c87 < def process( cat_count :dict[ str, int ] ) ->Generator[ str, None, N +one ]: --- > def process( cat_count :dict ) ->Generator: 110c110 < def sort_via_cmp_to_key( cat_count :dict[ str, int ] ) ->list[ str ] +: --- > def sort_via_cmp_to_key( cat_count :dict ) ->list:` [download] `$ ./pyston_2.3.5/bin/pyston3 llil2.py big1.txt big2.txt big3.txt > out +3.txt collect time : 3.7 s sort_via_cmp_to_key time: 12.2 s sort+format time : 12.2 s total processing time : 15.9 s output time : 3.0 s total time : 18.8 s` [download] Look at Pyston go; taking Python performance to a new level :) How does Python/Pyston compare to Perl? I ran the dualvar demonstration for comparison. `$ perl dualvar.pl big1.txt big2.txt big3.txt >out4.txt start get properties: 6 secs sort + output: 16 secs total: 22 secs` [download] The Perl code consumes around 2340 MB versus 1790 MB for Python. See also: Python 3.12 Goals Python is about to get faster The Pyston Blog Numba	[reply] [d/l] [select]
Re^3: Rosetta Code: Long List is Long -- Python by parv (Parson) on Dec 10, 2022 at 04:51 UTC
In short, could you do another Python, Pyston* run with sort function replaced with ... `def sort_native( cat_count ): once = sorted( cat_count.keys() ) return sorted( once, key = lambda k: cat_count[ k ], reverse = True + )` [download] ...? In long, before eyepopslikeamosquito posted ... `sort { $href->{$b} <=> $href->{$a} } sort keys %{$href}` [download] ... I had missed to notice the sorting order. I decided to do the same for Python (native) version instead of using `functools.cmp_to_key` function. I also realized that I was doing the sorting other way (sorting keys by value, followed by sorting by key), so was not getting the expected output (which made me to use `cmp_to_key` instead). Replacing `sort_via_cmp_to_key` with ... def sort_native( cat_count :dict[ str, int ] ) ->list[ str ]: """ Returns: A `list` of sorted keys by decreasing order of values & increasi +ng order of keys. Args: cat_count: A `dict` with string key & integer value. """ once = sorted( cat_count.keys() ) return sorted( once, key = lambda k: cat_count[ k ], reverse = True + ) [download] ... reduces the sort time by ~10 times (~11-16 s; also produces the expected output) in my run environment. time passes, so slowly🎶 ... Putting the complete program here (~28-35 s) ... Read more... Python version, updated with native sort (6 kB)	[reply] [d/l] [select]
Re^4: Rosetta Code: Long List is Long -- Python by marioroy (Prior) on Dec 10, 2022 at 07:05 UTC
> In short, could you do another Python, Pyston* run with sort function replaced with ... First, I made the following modification to resolve "TypeError: 'type' object is not subscriptable" using Pyston 2.3.5. `$ diff llil1.py llil2.py 53c53 < def collect( data_list :list ) ->dict[ str, int ]: --- > def collect( data_list :list ) ->dict: 87c87 < def process( cat_count :dict[ str, int ] ) ->Generator[ str, None, N +one ]: --- > def process( cat_count :dict ) ->Generator: 110c110 < def sort_native( cat_count :dict[ str, int ] ) ->list[ str ]: --- > def sort_native( cat_count :dict ) ->list:` [download] Python 3.10.7: `# I have pyston_lite_autoload installed already. It can be disabled. $ DISABLE_PYSTON=1 python3 llil2.py big1.txt big2.txt big3.txt > out1. +txt collect time : 5.1 s sort+format time : 3.6 s total processing time: 8.7 s output time : 3.6 s total time : 12.2 s` [download] Python with Pyston-lite loaded automatically: `# Install pyston_lite_autoload to ~/.local/lib/python3.10/site-package +s/. $ python3 -m pip install --user pyston_lite_autoload $ python3 llil2.py big1.txt big2.txt big3.txt > out2.txt collect time : 4.3 s sort+format time : 3.5 s total processing time: 7.9 s output time : 3.3 s total time : 11.2 s` [download] Pyston 2.3.5: `$ ./pyston_2.3.5/bin/pyston3 llil2.py big1.txt big2.txt big3.txt > out +3.txt collect time : 3.7 s sort+format time : 3.4 s total processing time: 7.1 s output time : 3.0 s total time : 10.1 s` [download] Pyston makes it all the more fun.	[reply] [d/l] [select]
Re^5: Rosetta Code: Long List is Long -- Python by parv (Parson) on Dec 10, 2022 at 07:17 UTC


Welcome to the Monastery
	PerlMonks