in reply to character encoding ambiguities when performing regexps with html entities

I do something a bit different when working within vim to take pasted text from an MS Word document, and translate the few oddball characters I frequently encounter into html entities.

It took a good bit of experimentation to work this out, but it works well and consistently in translating on the fly, per line or per selection.

From my .vimrc.web:

let myentity = "–—“”‘’«»…ãáçêé¼½¾¿°"
nmap <buffer> <silent> <localleader>utf :.!perl -MHTML::Entities -Mutf8 -lne 'utf8::decode($_); print encode_entities($_, qq{<C-R>=g:myentity<CR>} );'<CR>
vmap <buffer> <silent> <localleader>utf :!perl -MHTML::Entities -Mutf8 -lne 'utf8::decode($_); print encode_entities($_, qq{<C-R>=g:myentity<CR>});'<CR>

Translated, removed from it's vim environment, the line would look something like:

perl -MHTML::Entities -Mutf8 -lne 'utf8::decode($_); print encode_entities($_, qq{–—“”‘’«»…ãáçêé¼½¾¿°});' <yourfile>

As always, your mileage may vary, but you should find this useful and consistent. :-)

update: The above are supposed to be the actual utf-8 literals. In other words, you should see more of these: "ãáçêé¼½¾¿°" and NONE of these: "&#8211;&#8212;&#8220;&#8221;&#8216;&#8217;"

If I had marked the above as <code>, the literals were all escaped which obscured the whole point of the post reply.

  • Comment on Re: character encoding ambiguities when performing regexps with html entities
  • Download Code