in reply to parsing reserved chars with xml::simple

thanks for all your help.

I agree...in an ideal situation &,',",<,> etc should not be in the incoming XML. however i have run into some situations where this "bad" data has to be sent in and needs to be preserved from the incoming xml after it is parsed. But i cannot parse it with the reserved characters in it. I am having trouble running through the xml and converting the reserved characters: < to !40 etc. because it will convert all the < characters to !40. I need to break my xml apart so that i only call the convertchar function on the data elements of the xml and not on the tags themselves.
$incoming_xml = ' <?xml version="1.0" encoding="UTF-8"?> <TRANSACTION> <FIELDS> <FIELD KEY="user">name</FIELD> <FIELD KEY="password">pass&word</FIELD> <FIELD KEY="operation_type"><do_what></FIELD> </FIELDS> </TRANSACTION>';
idea: split the incoming xml at the >DATA TO BE CONVERTED or WHITESPACE< and just call the convertchar method on the data between the ><
I will need to be sure that i keep the xml syntax exactly how it should be. so based on my example $incoming_xml once the regex is run the converted xml should look like:
$converted_xml = ' <?xml version="1.0" encoding="UTF-8"?> <TRANSACTION> <FIELDS> <FIELD KEY="user">name</FIELD> <FIELD KEY="password">pass!38word</FIELD> <FIELD KEY="operation_type">!40do_what!41</FIELD> </FIELDS> </TRANSACTION>';
once converted i will send it through XMLin then i can loop through the resulting hash and convert the !# back to their original values therefore preserving the necessary bad values.

i am having trouble creating a regex to do this though. Once again i appreciate all your help.

Replies are listed 'Best First'.
Re: Re: parsing reserved chars with xml::simple
by diotalevi (Canon) on Feb 17, 2004 at 19:12 UTC

    s/&(?!amp|quot|apos|lt|gt)/&amp;/g; may work for you in some simple cases. This replaces ampersands that are not immediately followed by something that might be a valid tag with an escaped version.

      that works for & but it won't work for <,>,',"
      regex are: $text =~ s/&(?!amp|quot|apos|lt|gt)/!38/g; $text =~ s/"(?!amp|quot|apos|lt|gt)/!34/g; $text =~ s/<(?!amp|quot|apos|lt|gt)/!40/g; $text =~ s/>(?!amp|quot|apos|lt|gt)/!41/g; $text =~ s/'(?!amp|quot|apos|lt|gt)/!39/g;
      when i run that on my xml i get:
      !40?xml version=!341.0!34 encoding=!34UTF-8!34?!41 !40TRANSACTION!41 !40FIELDS!41 !40FIELD KEY=!34user!34!41name!40/FIELD!41 !40FIELD KEY=!34password!34!41pass!38word!40/FIELD!41 !40FIELD KEY=!34operation_type!34!41!40do_what!41!40/FIELD!41 !40/FIELDS!41 !40/TRANSACTION!41
      so i am still in the same spot i need to be able to only manipulate the data in between > <

      thanks again for your help
        What is !41? If you are going to escape the characters, you should use the right escape values. You can either use the entities: &, >, <, ", or the numeric character references: &, >, <, ".
Re: Re: parsing reserved chars with xml::simple
by iburrell (Chaplain) on Feb 17, 2004 at 21:20 UTC
    What do you mean by "where this 'bad' data has to be sent in and needs to be preserved from the incoming xml after it is parsed". The reserved characters should always be escaped in the XML or it isn't valid XML. If the source of the XML is not escaping the invalid characters, it is a bug in that program and should be fixed.

    Most parsers, including XML::Simple will decode the entities into the characters. Most generators will do the same transformation. Most parsers have options to control if they parse the entities or not.