comment on

In such a case I would use CDATA sections. First wrap the content of the dodgy elements in <[CDATA[ ... ]]>, then you can parse them without problem:

#!/bin/perl -w

use strict;
use XML::Twig;
use Data::Dumper;

# generate a file where the content of rec1 and something is
# stuck in CDATA sections
my $tmp="tmp";
open( TMP, ">$tmp") or die "$0 cannot open $tmp: $!"; 
while( <DATA>)
  { s{<(rec1|something)>}{<$1><![CDATA[}g;
    s{</(rec1|something)>}{]]></$1>}g;
    print TMP $_;
  }
close TMP;

# sorry, I could not help but use XML::Twig for this
my %data;

my $t= XML::Twig->new( 
         twig_handlers => { r => sub { $data{$_->field( 'key')}= 
                                        { rec1      => $_->field( 'rec
+1'), 
                                          something => $_->field( 'som
+ething')
                                        };
                                       $_[0]->purge; # I like to save 
+memory 
                                      }
                           },
                      );
$t->parsefile( $tmp);
print Dumper( %data);


__DATA__
<data>
  <r>
    <key>k1</key>
    <rec1>data</rec1>
    <something>else</something>
  </r>
  <r>
    <key>k2</key>
    <rec1>includes <br> and non UTF-8 chars like é, or nasties like < 
+</rec1>
    <something>else, <p>ugly <i>too<b>isn't</i> it</b> & all</p></some
+thing>
  </r>
</data>
[download]

The only restrictions are: if the "embedded" data contains ]]> then you must break the CDATA section, and the data of course it must not contain </rec1> or </something>.

In reply to Re: parsing XMLish data by mirod
in thread parsing XMLish data by gav^

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.