in reply to regex on XML

I have been down that road a couple of times, and ended up with ugly regexps that tried to identify whether < or & where part of the markup or needed to be escaped. Basically a < that's not followed by a letter should probably be escaped, and a & that's not followed by (#x?\d+|\w+;) should be escaped. Be sure to trace what you replace so you can spot problems.

Down this path lies madness though. If the provider of the data claims it's XML, then you usually have a good leverage to force them to fix it at the source. That's the sanest way to go. A little work on their part (maybe you can help them) will save you and eventually them lots of headaches down the road.

Just for the fun, I have actually used an other (wrong) option: provided the XML is close enough to SGML, and has a DTD (or you can write its DTD easily), you can try using sx (also called osx in some linux distributions) to convert the SGML to XML. SGML is actually much more lax about what needs to be escaped, the parser will try to figure out whether a < or & is a separate token, or part of the markup. But once again that's just a stop gap (and probably quite a hard one to set-up), try to get the "quasi-XML" to be XML, and spend your time doing useful things instead of fixing other people's mistakes.