It's a simple start, at least.

Usage of non-alphabetic marks in text (in English, at least) will always pose some boundary cases that are really hard or basically impossible to treat with a straight-forward, procedural algorithm (and on top of that, people who create text tend to make mistakes or ignore "rules" of style).

For the current task, there's the problem of the possessive apostrophe without a following "s" (because the word ends in "s") -- and sometimes, punctuation will follow a close-quote (even though style manuals say it shouldn't). Here's a worst case for you:

'You've got to talk to Miles' brother', she said.

Easy for humans, hard for programs. There is a regex that will treat this one correctly:

s/ '(.*)'(\W)/ "$1"$2/; # note the greedy use of ".*"
but it will screw up on some other case that would need a non-greedy match, like:

When he said 'kiss the sky,' I heard 'kiss this guy.'

You just have to make a guess what sort of mistake will happen less often (and hope your data isn't really this bad).

One other hint: for stuff like this, where initial and final positions in the string might make things more complicated, it's okay to "cheat" a little: add a space or some other "safe" character at the beginning and end of the string before working on the quotes, so that the edge cases can be treated just like the non-edge cases. You can take the edge padding off when you're done.


In reply to Re^3: using substitution and pattern matching by graff
in thread using substitution and pattern matching by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.