Ken, I really like your post++.

A couple of very,very minor nits which I show in code below:

  1. I think the fastest way to remove leading and trailing white space is like the code below, using 2 Perl statements instead of $string =~ s/^\s+|\s+$//g or your my ($trim) = /^\s*(.*?)\s*$/;. The Perl documentation talks about this somewhere in the regex docs. But a quick search didn't find this quickly otherwise I would post a link. Anyway, the explanation goes that regex engine works best with fixed anchors and that 2 very easy regex statements run faster than a single more complex one.
  2. I split your $re statement into two parts to simplify the syntax. Creating an intermediate variable is very "cheap". I didn't benchmark, but your code creates an anon array which is then de-referenced. My code only creates a scalar, which in general will be faster.
  3. I see no need at all to sort the search terms, so I didn't do that. The regex is going to match any of the 3 or'd "search phrases" no matter what the order in the regex is. Changing the order in the regex will not necessarily result in any performance change at all. The OP's requirement "for a sorted order" makes no sense to me at all.
  4. I see some suggestion to use threads or other parallel processing strategies. It appears to me that this will be an I/O bound application and such complex things won't matter at all.

Having said the above. Neither point makes a darn bit of difference in this case. I made this post because point (1) has relevance beyond this Op's question. For performance: The "setup" won't matter much because this is done once. Then: Read Line, Run Regex, Print Line is about as fast as this usually gets without complicated heroics.

Another Monk queried about the OP's purpose? Sometimes a post is just an academic question. Sounds like there is some real application here that we don't understand. The only reason to put these "markers" into the text is for later processing. Maybe that processing, whatever it is, can be combined into a single step? That could lead to a big speed increase. I mean that second step of processing will have to search the entire text to find the bbb markers yet again.

#!/usr/bin/env perl use strict; use warnings; use Inline::Files; my %seq; # example: 'scooped up again' => 'scoopedbbbupbbbagain', while (my $line = <SEQ>) { $line =~ s/^\s+//; $line =~ s/\s+$//; ($seq{$line} = $line) =~ s/\h+/bbb/g; } my $search_phrases = join '|', keys %seq; my $re = qr{($search_phrases)}; while (<TXT>) { s/$re/$seq{$1}/g; print; } __SEQ__ scooped up by social travesty without proper sanitation __TXT__ Many of them are scooped up by chambermaids, thrown into bin bags and +sent off to landfill sites, which is a disaster for the environment a +nd a social travesty given that many people around the world are goin +g without proper sanitation.

In reply to Re^2: script optmization by Marshall
in thread script optmization by shoura

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.