First, try not to use a delimiter for your regex that causes you to have to escape a lot more characters. More importantly, your regex has several mistakes. Trying to match "anything except this multi-character sequence" is almost always done wrong, even by top experts (at least a few times), so such isn't a big surprise. (:

For example, [^\/­][^\/]*? is really just the same as [^/­]+?, and so avoids matching single slashes rather than avoiding double slashes, as the construct hints at. So (?:(?:[^\/­][^\/]*?|)­*? boils down to ([^/]*?)*?, which is just an inefficient way of writing [^/]*?.

("|').*?\2 will stop matching too early for "This \"string\" with quotes" but can also "backtrack" and match too much. You really want to force this construct to only match exactly quoted strings. So, '([^'\\]+|\\.)*'|"([^"\\]+|\\.)*" instead.

Your "stuff I don't care about" needs to avoid matching quotes or slashes so that you don't just skip over a starting quote as "something I don't care about". So your regex needs something like ([^'"/]+|$quotes|...)*.

And a tricky part is the "skip over / but not over //". Something like (?<!/)/(?!/).

Which brings us to this:

$text =~ s< (^ (?: [^/'"]+ | '([^'\\]+|\\.)*' | "([^"\\]+|\\.)*" | (?<!/)/(?!/) )* ) //.* ><$1>xgm;

Which likely has several bugs. Note that I didn't allow for \ to cause the comment to continue on to subsequent lines because I both believe and hope that such doesn't actually work in the languages that I use //-comments in.

Note that the pseudo tokenizer needs to match any constructs that could contain quotes or slashes so, for example, /* ... */ would need to be handled if such might be encountered.

- tye        


In reply to Re: Removing '//' comments (tokenize) by tye
in thread Removing '//' comments by Yunus

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.