In such a case I would use CDATA sections. First wrap the content of the dodgy elements in <[CDATA[ ... ]]>, then you can parse them without problem:
#!/bin/perl -w use strict; use XML::Twig; use Data::Dumper; # generate a file where the content of rec1 and something is # stuck in CDATA sections my $tmp="tmp"; open( TMP, ">$tmp") or die "$0 cannot open $tmp: $!"; while( <DATA>) { s{<(rec1|something)>}{<$1><![CDATA[}g; s{</(rec1|something)>}{]]></$1>}g; print TMP $_; } close TMP; # sorry, I could not help but use XML::Twig for this my %data; my $t= XML::Twig->new( twig_handlers => { r => sub { $data{$_->field( 'key')}= { rec1 => $_->field( 'rec +1'), something => $_->field( 'som +ething') }; $_[0]->purge; # I like to save +memory } }, ); $t->parsefile( $tmp); print Dumper( %data); __DATA__ <data> <r> <key>k1</key> <rec1>data</rec1> <something>else</something> </r> <r> <key>k2</key> <rec1>includes <br> and non UTF-8 chars like é, or nasties like < +</rec1> <something>else, <p>ugly <i>too<b>isn't</i> it</b> & all</p></some +thing> </r> </data>
The only restrictions are: if the "embedded" data contains ]]> then you must break the CDATA section, and the data of course it must not contain </rec1> or </something>.
In reply to Re: parsing XMLish data
by mirod
in thread parsing XMLish data
by gav^
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |