in reply to Re^3: A Regex for no-break space Unicode Entities
in thread A Regex for no-break space Unicode Entities

utf8 sequences are completely distinguishable; the bytes \302\240 are not a subset of any other utf8 character. This is true for any utf8 sequence.

The possibilities for surprise I saw were perl ending up making other changes if the file contained invalid utf8 or characters not represented in the shortest possible sequence of bytes, or perl giving warnings.

  • Comment on Re^4: A Regex for no-break space Unicode Entities