in reply to Re: Python regex faster than Perl? - Chunking
in thread Python regex faster than Perl?

Here I try 10x the input size i.e. ~ 1 GB input file.

# choroba's input generator, 990 MB # https://perlmonks.org/?node_id=11164445 # perl gen_input.pl > big use strict; use warnings; for (1..100_000_000) { print int(rand 2) ? "xyzabcd" : ("mul(" . int(rand 5000) . "," . int(rand 5000) . ")"); print "\n" unless int rand 10; }

Non-chunking consumes greater than 1 GB of memory ~ 1.1 GB. Chunking consumes significantly less ~ 10 MB.

Non-chunking: > 1 GB memory

Perl 4.395s Found 1999533 matches. Python 2.262s Found 1999533 matches.

Chunking: ~ 10 MB memory

Perl 4.422s Found 1999533 matches. Python 2.247s Found 1999533 matches.

Replies are listed 'Best First'.
Re^3: Python regex faster than Perl? - Chunking 1 GB parallel
by marioroy (Prior) on Mar 27, 2025 at 08:31 UTC

    Perl parallel demonstration

    Seeing Python win by 2x made me consider a parallel variant using MCE. But more so to find out if this will scale increasing the number of workers.

    #!/usr/bin/env perl # time NUM_THREADS=3 perl pcount.pl big use v5.36; use autodie; use MCE; exit 1 if not @ARGV; my $mul_pattern = 'mul\(\d{1,3},\d{1,3}\)'; my $filename = shift; my $count = 0; sub reduce_count ($worker_count) { $count += $worker_count; } my $mce = MCE->new( max_workers => $ENV{NUM_THREADS} // MCE::Util::get_ncpu(), chunk_size => 65536*16, use_slurpio => 1, gather => \&reduce_count, user_func => sub { my ($mce, $slurp_ref, $chunk_id) = @_; my $count = () = $$slurp_ref =~ m/$mul_pattern/g; $mce->gather($count); } )->spawn; $mce->process({ input_data => $filename }); $mce->shutdown; print "Found $count matches.\n";

    This calls for slurpio for best performance. No line by line processing behind the scene. The MCE gather option is set to a reduce function to tally the counts. And, increased chunk_size to reduced IPC among the workers. The input file is read serially.

    Results

    Found 1999533 matches. 1: 4.420s 2: 2.263s needs 2 workers to reach Python performance 3: 1.511s 4: 1.154s 5: 0.940s 6: 0.788s 7: 0.680s 8: 0.600s 9: 0.538s

    Python parallel demonstration

    Now I wonder about parallel in Python. We can reuse the chunk function introduced in the prior example.

    #!/usr/bin/env python # time NUM_THREADS=3 python pcount.py big import os, re, sys from multiprocessing import Pool, cpu_count if len(sys.argv) < 2: sys.exit(1) def read_file (file, chunk_size=65536*16): """ Lazy function generator to read a file in chunks, including to the end of line. """ while True: chunk = file.read(chunk_size) if not chunk: break if not chunk.endswith('\n'): chunk += file.readline() yield chunk def process_chunk(chunk): """ Worker function to process chunks in parallel. """ count = len(mul_re.findall(chunk)) return count mul_re = re.compile(r"mul\(\d{1,3},\d{1,3}\)") num_processes = int(os.getenv('NUM_THREADS') or cpu_count()) p = Pool(num_processes) file_name = sys.argv[1] try: with open (file_name, "r") as file: results = p.map(process_chunk, read_file(file)) p.close() p.join() except Exception as e: print(e, file=sys.stderr) sys.exit(1) print(f"Found {sum(results)} matches.")

    Results

    Found 1999533 matches. 1: 3.131s 2: 1.824s 3: 1.408s 4: 1.178s 5: 1.187s 6: 1.187s 7: 1.172s 8: 1.008s 9: 0.995s

      Yet, another attempt with MCE-like chunking for more apple-to-apple comparison. Workers seek to offset position and slurp the whole chunk.

      Last updated on March 28, 2025.

      Python parallel demonstration

      #!/usr/bin/env python # time NUM_THREADS=3 python pcount2.py big import os, re, sys from concurrent.futures import ProcessPoolExecutor, as_completed from multiprocessing import Pool, cpu_count, Lock if len(sys.argv) < 2: sys.exit(1) mul_re = re.compile(r"mul\(\d{1,3},\d{1,3}\)") lock = Lock() def process_chunk(params): """ Worker function to process chunks in parallel. Read IO is serial to not cause a denial of service for SAN storage. Comment out locking for parallel IO. """ file_name, chunk_start, chunk_end = params with open (file_name, "r") as f: f.seek(chunk_start) # lock.acquire() chunk = f.read(chunk_end - chunk_start) # lock.release() count = len(mul_re.findall(chunk)) return count def parallel_read(file_name): num_processes = int(os.getenv('NUM_THREADS') or cpu_count()) def gen_offsets(): # Emit next offset input [ file_name, chunk_start, chunk_end ] try: file_size = os.path.getsize(file_name) except Exception as e: print(e, file=sys.stderr); sys.exit(1) chunk_size = 65536 * 16 position = 0 with open(file_name, 'r') as f: while True: chunk_start = position if chunk_start > file_size - 1: break if chunk_start + chunk_size <= file_size: f.seek(chunk_start + chunk_size - 1) if f.read(1) == '\n': # Chunk ends with linefeed chunk_end = chunk_start + chunk_size position += chunk_size else: # Include the rest of line length = len(f.readline()) chunk_end = chunk_start + chunk_size + length position += chunk_size + length else: position = chunk_end = file_size yield [ file_name, chunk_start, chunk_end ] # Run chunks in parallel and tally count count = 0 # # Map possibly slower due to overhead preserving order # with Pool(num_processes) as p: # results = p.map(process_chunk, gen_offsets()) # count = sum(results) # Try imap_unordered when ordered results unnecessary with Pool(num_processes) as p: results = list(p.imap_unordered(process_chunk, gen_offsets())) count = sum(results) # # Try also, concurrent.futures # with ProcessPoolExecutor(max_workers=num_processes) as executor: # futures = [executor.submit(process_chunk, params) \ # for params in gen_offsets()] # for future in as_completed(futures): # count += future.result() return count count = parallel_read(sys.argv[1]) print(f"Found {count} matches.")

      Results

      Found 1999533 matches. 1: 2.628s 2: 1.342s 3: 0.914s 4: 0.703s 5: 0.571s 6: 0.482s 7: 0.423s 8: 0.375s 9: 0.339s