Recently, I tried diffing two large files only for the UNIX diff command to choke the OS. That's because the diff utility slurps both files, requiring 2x memory consumption.

I thought to provide chunking variants and measure the time taken for a 99 MB file.

Perl

#!/usr/bin/env perl #use v5.20; #use feature qw(signatures); #no warnings qw(experimental::signatures); use v5.36; use autodie; exit 1 if not @ARGV; sub read_file ($fh, $chunk_size=65536) { # Return the next chunk, including to the end of line. read($fh, my $chunk, $chunk_size); if (length($chunk) && substr($chunk, -1) ne "\n") { return $chunk if eof($fh); $chunk .= readline($fh); } return $chunk; } my $mul_pattern = 'mul\(\d{1,3},\d{1,3}\)'; my $filename = shift; my $count = 0; if (open(my $fh, '<', $filename)) { while (length(my $chunk = read_file($fh))) { $count += () = $chunk =~ m/$mul_pattern/g; } } print "Found $count matches.\n";

Python

#!/usr/bin/env python import re, sys if len(sys.argv) < 2: sys.exit(1) def read_file (file, chunk_size=65536): """ Lazy function generator to read a file in chunks, including to the end of line. """ while True: chunk = file.read(chunk_size) if not chunk: break if not chunk.endswith('\n'): chunk += file.readline() yield chunk mul_re = re.compile(r"mul\(\d{1,3},\d{1,3}\)") filename = sys.argv[1] count = 0 try: with open (filename, "r") as file: for chunk in read_file(file): count += len(mul_re.findall(chunk)) except Exception as e: print(e, file=sys.stderr) sys.exit(1) print(f"Found {count} matches.")

Results

Perl 0.463s Found 200246 matches. Python 0.250s Found 200246 matches.

In reply to Re: Python regex faster than Perl? - Chunking by marioroy
in thread Python regex faster than Perl? by dave93

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.