Suppose I have an input file that looks like this:
This is just lines of text here, and also there. Consider this human readable text; it's full of letters and punctuation.
The input file could be several hundred lines. Now also suppose I have a hash table containing entries like:
my $table = { "lines_of_text" => "foo.html", "this" => "bar.html", "its_full" => "foobar.html", }
There could potentially be maybe 5000 entries in this table. For each entry in the hash table, we want to find the first segment of input data that could map to the key and replace it with appropriate html. So this:
... just lines of text ...
turns into this:
... just <a href="foo.html">lines of text</a> ...
Of course, we link the initial This but not the this starting line 2.

I only see two ways of solving this problem, and both of them are extremely inelligant. I could either write a massive regular expression, or'ing together all of the 5000 keys, or I could search through the file one letter at a time and see if any keys started at that point. Clearly both of these solutions are unacceptable.

The trickiness in this problem comes from the fact that the hash table needs to convert many possible input formats. For example, if I had the key "abc", I should translate ABC AB'C and A,B'C but not A B C. I could apply some massive substitution to the input data set, but because some letters are deleted (like comma and apostrophy), there isn't an easy translation between character index in the original data set and character index in the translated data set.

What ideas do other Monks have for solving this problem?

-Ted

In reply to Mass Text Replacement by tedv

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.