Re: character encoding ambiguities when performing regexps with html entities

I do something a bit different when working within vim to take pasted text from an MS Word document, and translate the few oddball characters I frequently encounter into html entities.

It took a good bit of experimentation to work this out, but it works well and consistently in translating on the fly, per line or per selection.

From my .vimrc.web:

let myentity = "–—“”‘’姣…蝁賙曌蝨膩"
nmap <buffer> <silent> <localleader>utf :.!perl -MHTML::Entities -Mutf8 -lne 'utf8::decode($_); print encode_entities($_, qq{<C-R>=g:myentity<CR>} );'<CR>
vmap <buffer> <silent> <localleader>utf :!perl -MHTML::Entities -Mutf8 -lne 'utf8::decode($_); print encode_entities($_, qq{<C-R>=g:myentity<CR>});'<CR>

Translated, removed from it's vim environment, the line would look something like:

perl -MHTML::Entities -Mutf8 -lne 'utf8::decode($_); print encode_entities($_, qq{–—“”‘’姣…蝁賙曌蝨膩});' <yourfile>

As always, your mileage may vary, but you should find this useful and consistent. :-)

update: The above are supposed to be the actual utf-8 literals. In other words, you should see more of these: "蝁賙曌蝨膩" and NONE of these: "–—“”‘’"

If I had marked the above as <code>, the literals were all escaped which obscured the whole point of the post reply.

Comment on Re: character encoding ambiguities when performing regexps with html entities Download Code