In such a case I would use CDATA sections. First wrap the content of the dodgy elements in <[CDATA[ ... ]]>, then you can parse them without problem:

#!/bin/perl -w use strict; use XML::Twig; use Data::Dumper; # generate a file where the content of rec1 and something is # stuck in CDATA sections my $tmp="tmp"; open( TMP, ">$tmp") or die "$0 cannot open $tmp: $!"; while( <DATA>) { s{<(rec1|something)>}{<$1><![CDATA[}g; s{</(rec1|something)>}{]]></$1>}g; print TMP $_; } close TMP; # sorry, I could not help but use XML::Twig for this my %data; my $t= XML::Twig->new( twig_handlers => { r => sub { $data{$_->field( 'key')}= { rec1 => $_->field( 'rec +1'), something => $_->field( 'som +ething') }; $_[0]->purge; # I like to save +memory } }, ); $t->parsefile( $tmp); print Dumper( %data); __DATA__ <data> <r> <key>k1</key> <rec1>data</rec1> <something>else</something> </r> <r> <key>k2</key> <rec1>includes <br> and non UTF-8 chars like é, or nasties like < +</rec1> <something>else, <p>ugly <i>too<b>isn't</i> it</b> & all</p></some +thing> </r> </data>

The only restrictions are: if the "embedded" data contains ]]> then you must break the CDATA section, and the data of course it must not contain </rec1> or </something>.


In reply to Re: parsing XMLish data by mirod
in thread parsing XMLish data by gav^

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.