hippo> I am intrigued by some of your s/// operations - perhaps you could confirm that these give your intended outputs?

Yes, you're right , the actual match/subs are non-greedy. I just wanted to provide a simpler and beautified version of my ugly script but the code structure is exactly the same.

Corion> Regardless of the performance problems, you may be interested in using a proper stemmer to create a search index. See Lingua::Stem.

I don't need (yet) a full stemming solution, which might not be the ideal tool as I'd have to override numerous substitutions.

hv: Your hash lookup implementation runs twice as fast (34" vs 1'05" for my here-doc regexes). Another difference is it runs faster when operating on lines compared to words. sed seems unbeatable at 6 seconds.

AnomalousMonk> Here's something that may address your needs more closely. As always, the fine details of regex definition are critical. I still have no idea as to relative speed :)

I tested your solution last but unfortunately it took 2'23" to complete. I'll be doing more tests in the following days and report back with any progress. Thank you all for your wisdom.


In reply to Re^4: Need to speed up many regex substitutions and somehow make them a here-doc list by xnous
in thread Need to speed up many regex substitutions and somehow make them a here-doc list by xnous

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.