Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Re: Rosetta Code: Long List is Long -- Python

by parv (Parson)
on Dec 02, 2022 at 10:26 UTC ( [id://11148494]=note: print w/replies, xml ) Need Help??


in reply to Rosetta Code: Long List is Long

Environment: FreeBSD stable/13 (amd64) as guest in VirtualBox 6.1.30 on Windows 10 host; 2 CPUs & ~8 GB of RAM are allocated to the VM on ThinkPad X260 (Intel i5-6500U CPU, 16 GB RAM). perl is 5.32.1; python is 3.10.8.

I may attempt at a Rust, which is I am currently learning, version.

Your Perl version takes ~7-8 minute (time to collect data is same enough as my Python version). Mind that time values are of just one run each.

llil start get_properties : 44 secs sort + output : 429 secs total : 473 secs

From choroba's attempt ...

start get properties: 60 secs sort + output: 63 secs total: 123 secs

From my Python attempt (generally takes ~3-4 minute) ...

# Manually rounded off after posting. collect time : 37.6 s sort_via_cmp_to_key time : 115.1 s sort+format time : 115.1 s total processing time : 152.7 s output time : 22.0 s total time : 174.7 s

... most of time savings (of ~10-20 s) over my earlier Python attempts came from ...

  • printing the output as one string instead of printing each line in a loop;
  • sorting the dict via keys instead of sorting an iterator of tuples created from the "dict";
  • and using subtraction to compare the numbers instead of explicitly returning one of -1, 0, 1 based on [><=] comparison.

Later I realized there was one extraneous loop; removed now, but the patch keeps the history.

#!/usr/local/bin/python3.10 # This is one Python implementation based on the problem specification + ... # # Rosetta Code: Long List is Long, 20221130, # by eyepopslikeamosquito # https://perlmonks.org/?node_id=11148465 # # -------------------------------------------------------------------- +-- # LLiL specification # ------------------ # A LLiL-format file is a text file. # Each line consists of a lowercase name a TAB character and a non-neg +ative integer count. # That is, each line must match : ^[a-z]+\t\d+$ # For example, reading the LLiL-format files, tt1.txt containing: # camel\t42 # pearl\t94 # dromedary\t69 # and tt2.txt containing: # camel\t8 # hello\t12345 # dromedary\t1 # returns this hashref: # $hash_ret{"camel"} = 50 # $hash_ret{"dromedary"} = 70 # $hash_ret{"hello"} = 12345 # $hash_ret{"pearl"} = 94 # That is, values are added for items with the same key. # # To get the required LLiL text, you must sort the returned hashref # descending by value and insert a TAB separator: # hello\t12345 # pearl\t94 # dromedary\t70 # camel\t50 # To make testing via diff easier, we further sort ascending by name # for lines with the same value. # -------------------------------------------------------------------- +-- import sys from functools import cmp_to_key from pathlib import PosixPath from timeit import default_timer from typing import Generator # Collect time. time_stat = dict() # Takes ~35-46 s. def collect( data_list :list ) ->dict[ str, int ]: """ Returns: A `dict` with category as key & total count as value. Args: data_list: list of file paths as `pathlib.PosixPath` objects. Side effect: Updates `time_stat` with the time taken to collect the data. """ start = default_timer() cat_count = dict() delimiter = '\t' open_prop = { 'mode' : 'rt', 'encoding' : 'ascii', 'newline' : '\n' } for path in data_list: with path.open( **open_prop ) as fh: for line in fh: category, number = line.split( delimiter, 1 ) if category not in cat_count.keys(): cat_count[ category ] = 0 cat_count[ category ] += int( number ) time_stat['collect time'] = default_timer() - start return cat_count # Time taken to create a "Generator" of formatted strings gets lost in + decimal # place; largely is contributed by "sort_via_cmp_to_key" function. def process( cat_count :dict[ str, int ] ) ->Generator[ str, None, Non +e ]: """ Returns: A generator of strings composed of category & total count, sorte +d by total count in descending order & by category in ascending order +. Args: cat_count: `dict` of category as key & total count as value. Side effect: Updates `time_stat` dict with the time taken to sort & format th +e data. """ start = default_timer() formatted = ( f'{k}\t{cat_count[ k ]}' for k in sort_via_cmp_to_key( cat_count ) ) time_stat['sort+format time'] = default_timer() - start return formatted # Takes ~103-115 s. def sort_via_cmp_to_key( cat_count :dict[ str, int ] ) ->list[ str ]: """ Returns: A a `list` of sorted keys; sorted by total count in descending o +rder & by category in ascending order. Args: cat_count: `dict` of category as key & total count as value. Side effect: Updates `time_stat` with the time taken to sort. """ start = default_timer() # Set up old fashioned comparison function. def compare_desc_count_asc_cat( a :str, b :str ): # key: category; # value: total count. # Ascending category order. if cat_count[ a ] == cat_count[ b ]: if a < b: return -1 if a > b: return 1 return 0 # Descending count order. return cat_count[ b ] - cat_count[ a ] # comparator = cmp_to_key( compare_desc_count_asc_cat ) out = sorted( cat_count.keys(), key = comparator ) time_stat['sort_via_cmp_to_key time'] = default_timer() - start return out # Collect file paths. if sys.argv[1:]: data_list = [ PosixPath( p ) for p in sys.argv[1:] ] else: sys.exit( 'Give a file list with data to process' ) process_start_time = default_timer() processed = process( collect( data_list ) ) process_end_time = default_timer() # Faster than printing one by one. # Takes ~14-25 s. print( '\n'.join( processed ) ) output_end_time = default_timer() time_stat['total processing time'] = process_end_time - process_start_ +time time_stat['output time'] = output_end_time - process_end_time time_stat['total time'] = output_end_time - process_start_time # decimal_place = 1 round_format = f'.{decimal_place}f' time_width = max( len( f'{t:{round_format}}' ) for t in time_stat.valu +es() ) \ + decimal_place +1 label_width = max( len( label ) for label in time_stat.keys() ) # for label, time_taken in time_stat.items(): sys.stderr.write( f'{label:{label_width}}: {time_taken:{time_width} +{round_format}} s\n' )

... said patch (this filler text is to better separate from preceding code portion) ...

@@ -98,20 +98,20 @@ """ start = default_timer() - formatted = ( f'{sk}\t{sv}' - for sk, sv in sort_via_cmp_to_key( cat_count ) + formatted = ( f'{k}\t{cat_count[ k ]}' + for k in sort_via_cmp_to_key( cat_count ) ) time_stat['sort+format time'] = default_timer() - start return formatted -# Takes ~115-120 s. -def sort_via_cmp_to_key( cat_count :dict[ str, int ] ) ->Generator[ t +uple[ str, int ], None, None ]: +# Takes ~103-115 s. +def sort_via_cmp_to_key( cat_count :dict[ str, int ] ) ->list[ str ]: """ Returns: - A generator of 2-element `tuple`s of category & total count, so +rted by - total count in descending order & by category in ascending orde +r. + A a `list` of sorted keys; sorted by total count in descending +order & by + category in ascending order. Args: cat_count: `dict` of category as key & total count as value. @@ -137,9 +137,7 @@ # comparator = cmp_to_key( compare_desc_count_asc_cat ) - out = ( ( k, cat_count[ k ] ) - for k in sorted( cat_count.keys(), key = comparator ) - ) + out = sorted( cat_count.keys(), key = comparator ) time_stat['sort_via_cmp_to_key time'] = default_timer() - start return out @@ -165,6 +163,12 @@ time_stat['output time'] = output_end_time - process_end_time time_stat['total time'] = output_end_time - process_start_time # +decimal_place = 1 +round_format = f'.{decimal_place}f' +time_width = max( len( f'{t:{round_format}}' ) for t in time_stat.val +ues() ) \ + + decimal_place +1 +label_width = max( len( label ) for label in time_stat.keys() ) +# for label, time_taken in time_stat.items(): - sys.stderr.write( f'{label} : {time_taken:0.1f}\n' ) + sys.stderr.write( f'{label:{label_width}}: {time_taken:{time_width +}{round_format}} s\n' )

Replies are listed 'Best First'.
Re^2: Rosetta Code: Long List is Long -- Python
by marioroy (Prior) on Dec 09, 2022 at 12:01 UTC

    This is neat, parv. I compared Python 3.10.7 against Pyston-lite and Pyston.

    Python:

    $ python3 llil.py big1.txt big2.txt big3.txt > out1.txt collect time : 5.2 s sort_via_cmp_to_key time: 17.7 s sort+format time : 17.7 s total processing time : 22.9 s output time : 3.5 s total time : 26.4 s

    Python with Pyston-lite loaded automatically:

    # Install pyston_lite_autoload to ~/.local/lib/python3.10/site-package +s/. $ python3 -m pip install --user pyston_lite_autoload $ python3 llil.py big1.txt big2.txt big3.txt > out2.txt collect time : 4.4 s sort_via_cmp_to_key time: 15.8 s sort+format time : 15.8 s total processing time : 20.2 s output time : 3.2 s total time : 23.4 s

    Pyston 2.3.5:

    I also tried Pyston 2.3.5. But first, I needed to modify the function declarations to resolve "TypeError: 'type' object is not subscriptable.

    $ diff llil.py llil2.py 53c53 < def collect( data_list :list ) ->dict[ str, int ]: --- > def collect( data_list :list ) ->dict: 87c87 < def process( cat_count :dict[ str, int ] ) ->Generator[ str, None, N +one ]: --- > def process( cat_count :dict ) ->Generator: 110c110 < def sort_via_cmp_to_key( cat_count :dict[ str, int ] ) ->list[ str ] +: --- > def sort_via_cmp_to_key( cat_count :dict ) ->list:
    $ ./pyston_2.3.5/bin/pyston3 llil2.py big1.txt big2.txt big3.txt > out +3.txt collect time : 3.7 s sort_via_cmp_to_key time: 12.2 s sort+format time : 12.2 s total processing time : 15.9 s output time : 3.0 s total time : 18.8 s

    Look at Pyston go; taking Python performance to a new level :) How does Python/Pyston compare to Perl? I ran the dualvar demonstration for comparison.

    $ perl dualvar.pl big1.txt big2.txt big3.txt >out4.txt start get properties: 6 secs sort + output: 16 secs total: 22 secs

    The Perl code consumes around 2340 MB versus 1790 MB for Python.

    See also:

    Python 3.12 Goals

    Python is about to get faster

    The Pyston Blog

    Numba

      In short, could you do another Python, Pyston* run with sort function replaced with ...

      def sort_native( cat_count ): once = sorted( cat_count.keys() ) return sorted( once, key = lambda k: cat_count[ k ], reverse = True + )

      ...?

      In long, before eyepopslikeamosquito posted ...

      sort { $href->{$b} <=> $href->{$a} } sort keys %{$href}

      ... I had missed to notice the sorting order. I decided to do the same for Python (native) version instead of using functools.cmp_to_key function. I also realized that I was doing the sorting other way (sorting keys by value, followed by sorting by key), so was not getting the expected output (which made me to use cmp_to_key instead).

      Replacing sort_via_cmp_to_key with ...

      def sort_native( cat_count :dict[ str, int ] ) ->list[ str ]: """ Returns: A `list` of sorted keys by decreasing order of values & increasi +ng order of keys. Args: cat_count: A `dict` with string key & integer value. """ once = sorted( cat_count.keys() ) return sorted( once, key = lambda k: cat_count[ k ], reverse = True + )

      ... reduces the sort time by ~10 times (~11-16 s; also produces the expected output) in my run environment.

      time passes, so slowly🎶 ... Putting the complete program here (~28-35 s) ...

        > In short, could you do another Python, Pyston* run with sort function replaced with ...

        First, I made the following modification to resolve "TypeError: 'type' object is not subscriptable" using Pyston 2.3.5.

        $ diff llil1.py llil2.py 53c53 < def collect( data_list :list ) ->dict[ str, int ]: --- > def collect( data_list :list ) ->dict: 87c87 < def process( cat_count :dict[ str, int ] ) ->Generator[ str, None, N +one ]: --- > def process( cat_count :dict ) ->Generator: 110c110 < def sort_native( cat_count :dict[ str, int ] ) ->list[ str ]: --- > def sort_native( cat_count :dict ) ->list:

        Python 3.10.7:

        # I have pyston_lite_autoload installed already. It can be disabled. $ DISABLE_PYSTON=1 python3 llil2.py big1.txt big2.txt big3.txt > out1. +txt collect time : 5.1 s sort+format time : 3.6 s total processing time: 8.7 s output time : 3.6 s total time : 12.2 s

        Python with Pyston-lite loaded automatically:

        # Install pyston_lite_autoload to ~/.local/lib/python3.10/site-package +s/. $ python3 -m pip install --user pyston_lite_autoload $ python3 llil2.py big1.txt big2.txt big3.txt > out2.txt collect time : 4.3 s sort+format time : 3.5 s total processing time: 7.9 s output time : 3.3 s total time : 11.2 s

        Pyston 2.3.5:

        $ ./pyston_2.3.5/bin/pyston3 llil2.py big1.txt big2.txt big3.txt > out +3.txt collect time : 3.7 s sort+format time : 3.4 s total processing time: 7.1 s output time : 3.0 s total time : 10.1 s

        Pyston makes it all the more fun.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11148494]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (2)
As of 2024-04-20 11:51 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found