in reply to tokenize a string

XML::TokeParser will return tokenized XML. You could write a wrapper around XML::TokeParser if you need the format you listed exactly.

Update: Oops, Sorry, I missed your last paragraph.