in reply to parsing XMLish data

In such a case I would use CDATA sections. First wrap the content of the dodgy elements in <[CDATA[ ... ]]>, then you can parse them without problem:

#!/bin/perl -w use strict; use XML::Twig; use Data::Dumper; # generate a file where the content of rec1 and something is # stuck in CDATA sections my $tmp="tmp"; open( TMP, ">$tmp") or die "$0 cannot open $tmp: $!"; while( <DATA>) { s{<(rec1|something)>}{<$1><![CDATA[}g; s{</(rec1|something)>}{]]></$1>}g; print TMP $_; } close TMP; # sorry, I could not help but use XML::Twig for this my %data; my $t= XML::Twig->new( twig_handlers => { r => sub { $data{$_->field( 'key')}= { rec1 => $_->field( 'rec +1'), something => $_->field( 'som +ething') }; $_[0]->purge; # I like to save +memory } }, ); $t->parsefile( $tmp); print Dumper( %data); __DATA__ <data> <r> <key>k1</key> <rec1>data</rec1> <something>else</something> </r> <r> <key>k2</key> <rec1>includes <br> and non UTF-8 chars like é, or nasties like < +</rec1> <something>else, <p>ugly <i>too<b>isn't</i> it</b> & all</p></some +thing> </r> </data>

The only restrictions are: if the "embedded" data contains ]]> then you must break the CDATA section, and the data of course it must not contain </rec1> or </something>.