Re: Python regex faster than Perl?

Recently, I tried diffing two large files only for the UNIX diff command to choke the OS. That's because the diff utility slurps both files, requiring 2x memory consumption.

I thought to provide chunking variants and measure the time taken for a 99 MB file.

Perl

#!/usr/bin/env perl
#use v5.20;
#use feature qw(signatures);
#no warnings qw(experimental::signatures);

use v5.36;
use autodie;

exit 1 if not @ARGV;

sub read_file ($fh, $chunk_size=65536) {
    # Return the next chunk, including to the end of line.
    read($fh, my $chunk, $chunk_size);
    if (length($chunk) && substr($chunk, -1) ne "\n") {
        return $chunk if eof($fh);
        $chunk .= readline($fh);
    }
    return $chunk;
}

my $mul_pattern = 'mul\(\d{1,3},\d{1,3}\)';
my $filename = shift;
my $count = 0;

if (open(my $fh, '<', $filename)) {
    while (length(my $chunk = read_file($fh))) {
        $count += () = $chunk =~ m/$mul_pattern/g;
    }
}

print "Found $count matches.\n";
[download]

Python

#!/usr/bin/env python
import re, sys

if len(sys.argv) < 2: sys.exit(1)

def read_file (file, chunk_size=65536):
    """
    Lazy function generator to read a file in chunks,
    including to the end of line.
    """
    while True:
        chunk = file.read(chunk_size)
        if not chunk:
            break
        if not chunk.endswith('\n'):
            chunk += file.readline()
        yield chunk

mul_re = re.compile(r"mul\(\d{1,3},\d{1,3}\)")
filename = sys.argv[1]
count = 0

try:
    with open (filename, "r") as file:
        for chunk in read_file(file):
            count += len(mul_re.findall(chunk))
except Exception as e:
    print(e, file=sys.stderr)
    sys.exit(1)

print(f"Found {count} matches.")
[download]

Results

Perl   0.463s
Found 200246 matches.

Python 0.250s
Found 200246 matches.
[download]

Comment on Re: Python regex faster than Perl? - Chunking Select or Download Code

Replies are listed 'Best First'.
Re^2: Python regex faster than Perl? - Chunking 1 GB by marioroy (Prior) on Mar 27, 2025 at 07:53 UTC
Here I try 10x the input size i.e. ~ 1 GB input file. `# choroba's input generator, 990 MB # https://perlmonks.org/?node_id=11164445 # perl gen_input.pl > big use strict; use warnings; for (1..100_000_000) { print int(rand 2) ? "xyzabcd" : ("mul(" . int(rand 5000) . "," . int(rand 5000) . ")"); print "\n" unless int rand 10; }` [download] Non-chunking consumes greater than 1 GB of memory ~ 1.1 GB. Chunking consumes significantly less ~ 10 MB. Non-chunking: > 1 GB memory `Perl 4.395s Found 1999533 matches. Python 2.262s Found 1999533 matches.` [download] Chunking: ~ 10 MB memory `Perl 4.422s Found 1999533 matches. Python 2.247s Found 1999533 matches.` [download]	[reply] [d/l] [select]
Re^3: Python regex faster than Perl? - Chunking 1 GB parallel by marioroy (Prior) on Mar 27, 2025 at 08:31 UTC
Perl parallel demonstration Seeing Python win by 2x made me consider a parallel variant using MCE. But more so to find out if this will scale increasing the number of workers. #!/usr/bin/env perl # time NUM_THREADS=3 perl pcount.pl big use v5.36; use autodie; use MCE; exit 1 if not @ARGV; my $mul_pattern = 'mul$\d{1,3},\d{1,3}$'; my $filename = shift; my $count = 0; sub reduce_count ($worker_count) { $count += $worker_count; } my $mce = MCE->new( max_workers => $ENV{NUM_THREADS} // MCE::Util::get_ncpu(), chunk_size => 6553616, use_slurpio => 1, gather => \&reduce_count, user_func => sub { my ($mce, $slurp_ref, $chunk_id) = @_; my $count = () = $$slurp_ref =~ m/$mul_pattern/g; $mce->gather($count); } )->spawn; $mce->process({ input_data => $filename }); $mce->shutdown; print "Found $count matches.\n"; [download] This calls for slurpio for best performance. No line by line processing behind the scene. The MCE gather option is set to a reduce function to tally the counts. And, increased chunk_size to reduced IPC among the workers. The input file is read serially. Results* `Found 1999533 matches. 1: 4.420s 2: 2.263s needs 2 workers to reach Python performance 3: 1.511s 4: 1.154s 5: 0.940s 6: 0.788s 7: 0.680s 8: 0.600s 9: 0.538s` [download] Python parallel demonstration Now I wonder about parallel in Python. We can reuse the chunk function introduced in the prior example. #!/usr/bin/env python # time NUM_THREADS=3 python pcount.py big import os, re, sys from multiprocessing import Pool, cpu_count if len(sys.argv) < 2: sys.exit(1) def read_file (file, chunk_size=6553616): """ Lazy function generator to read a file in chunks, including to the end of line. """ while True: chunk = file.read(chunk_size) if not chunk: break if not chunk.endswith('\n'): chunk += file.readline() yield chunk def process_chunk(chunk): """ Worker function to process chunks in parallel. """ count = len(mul_re.findall(chunk)) return count mul_re = re.compile(r"mul$\d{1,3},\d{1,3}$") num_processes = int(os.getenv('NUM_THREADS') or cpu_count()) p = Pool(num_processes) file_name = sys.argv[1] try: with open (file_name, "r") as file: results = p.map(process_chunk, read_file(file)) p.close() p.join() except Exception as e: print(e, file=sys.stderr) sys.exit(1) print(f"Found {sum(results)} matches.") [download] Results* `Found 1999533 matches. 1: 3.131s 2: 1.824s 3: 1.408s 4: 1.178s 5: 1.187s 6: 1.187s 7: 1.172s 8: 1.008s 9: 0.995s` [download]	[reply] [d/l] [select]
Re^4: Python regex faster than Perl? - Chunking 1 GB parallel by marioroy (Prior) on Mar 27, 2025 at 14:19 UTC
Yet, another attempt with MCE-like chunking for more apple-to-apple comparison. Workers seek to offset position and slurp the whole chunk. Last updated on March 28, 2025. Python parallel demonstration #!/usr/bin/env python # time NUM_THREADS=3 python pcount2.py big import os, re, sys from concurrent.futures import ProcessPoolExecutor, as_completed from multiprocessing import Pool, cpu_count, Lock if len(sys.argv) < 2: sys.exit(1) mul_re = re.compile(r"mul$\d{1,3},\d{1,3}$") lock = Lock() def process_chunk(params): """ Worker function to process chunks in parallel. Read IO is serial to not cause a denial of service for SAN storage. Comment out locking for parallel IO. """ file_name, chunk_start, chunk_end = params with open (file_name, "r") as f: f.seek(chunk_start) # lock.acquire() chunk = f.read(chunk_end - chunk_start) # lock.release() count = len(mul_re.findall(chunk)) return count def parallel_read(file_name): num_processes = int(os.getenv('NUM_THREADS') or cpu_count()) def gen_offsets(): # Emit next offset input [ file_name, chunk_start, chunk_end ] try: file_size = os.path.getsize(file_name) except Exception as e: print(e, file=sys.stderr); sys.exit(1) chunk_size = 65536 * 16 position = 0 with open(file_name, 'r') as f: while True: chunk_start = position if chunk_start > file_size - 1: break if chunk_start + chunk_size <= file_size: f.seek(chunk_start + chunk_size - 1) if f.read(1) == '\n': # Chunk ends with linefeed chunk_end = chunk_start + chunk_size position += chunk_size else: # Include the rest of line length = len(f.readline()) chunk_end = chunk_start + chunk_size + length position += chunk_size + length else: position = chunk_end = file_size yield [ file_name, chunk_start, chunk_end ] # Run chunks in parallel and tally count count = 0 # # Map possibly slower due to overhead preserving order # with Pool(num_processes) as p: # results = p.map(process_chunk, gen_offsets()) # count = sum(results) # Try imap_unordered when ordered results unnecessary with Pool(num_processes) as p: results = list(p.imap_unordered(process_chunk, gen_offsets())) count = sum(results) # # Try also, concurrent.futures # with ProcessPoolExecutor(max_workers=num_processes) as executor: # futures = [executor.submit(process_chunk, params) \ # for params in gen_offsets()] # for future in as_completed(futures): # count += future.result() return count count = parallel_read(sys.argv[1]) print(f"Found {count} matches.") [download] Results `Found 1999533 matches. 1: 2.628s 2: 1.342s 3: 0.914s 4: 0.703s 5: 0.571s 6: 0.482s 7: 0.423s 8: 0.375s 9: 0.339s` [download]	[reply] [d/l] [select]