Suppose I have an input file that looks like this:
This is just lines of text here, and also there. Consider
this human readable text; it's full of letters and
punctuation.
The input file could be several hundred lines. Now
also suppose I have a hash table containing entries like:
my $table = {
"lines_of_text" => "foo.html",
"this" => "bar.html",
"its_full" => "foobar.html",
}
There could potentially be maybe 5000 entries in this
table. For each entry in the hash table, we want to find
the first segment of input data that could map to the key
and replace it with appropriate html. So this:
... just lines of text ...
turns into this:
... just <a href="foo.html">lines of text</a> ...
Of course, we link the initial
This but
not the
this starting line 2.
I only see two ways of solving this problem, and both of
them are extremely inelligant. I could either write a
massive regular expression,
or'ing together all of the 5000
keys, or I could search through the file one letter at a
time and see if any keys started at that point. Clearly
both of these solutions are unacceptable.
The trickiness in this problem comes from the fact that the
hash table needs to convert many possible input formats.
For example, if I had the key
"abc", I should
translate
ABC AB'C and
A,B'C but not
A B C. I could
apply some massive substitution to the input data set, but
because some letters are deleted (like comma and apostrophy),
there isn't an easy translation between character index in the
original data set and character index in the translated data set.
What ideas do other Monks have for solving this problem?
-Ted
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.