sed can take take in several substitution regexes at once instead of piping each substitution result to the next: sed 's/ need.* / need /gi' | sed 's/ .*meant.* / mean /gi' can become sed 's/ need.* / need /gi;s/ .*meant.* / mean /gi'. This may speed up IO.

For both Perl and bash/sed: their IO can be improved by creating a ramdisk and placing input and output files in there if you intend to process them multiple times. Better if the files are created from other processes then you can create them straight into the ramdisk, process them and then transfer them to more permanent store. In Linux this is as easy as: mount -o size=2g -t tmpfs /mnt/ramdisk1

If you have all files living already in just one physical harddisk then parallelising their processing (which implies parallelising the IO) will show little improvement or, most likely, degradation. However you can see some improvement by implementing a pipeline: in one process files are copied into the ramdisk sequentially, the other processes are, in parallel, processing any files found in there. I assume that memory IO can be parallelised better than harddisk IO (but I am a lot behind in what modern OS and CPU can do, or it could be that MCE can work some magik with IO, so use some salt with this advice).

Also in a recent discussion here, the issue came up that a threaded perl interpreter can be 10-20-30% slower than a non-threaded one. So, if you do not need threads that's a possible way to speed things up (check your perl's interpreter compilation flags with: perl -V and look for useithreads=define)

This is an interesting problem to optimise because even small optimisations can lead to huge savings over your 1,000's to 1,000,000's files. So, I would start by benchmarking a few options with like 20 files: sed, sed+ramdisk, perl+ramdisk, pipeline, etc. Then you will be more confident in where to place your programming efforts or whether you can invest in learning new skills like MCE.

bw, bliako


In reply to Re^3: Need to speed up many regex substitutions and somehow make them a here-doc list by bliako
in thread Need to speed up many regex substitutions and somehow make them a here-doc list by xnous

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.