Well, & is a reserved characther. You need to escape it as an entity, i.e. &. If you do not like escaping entities, you can use the following paradigm:
<![CDATA[DATA GOES HERE]]>
Using CDATA tags, you can put whatever you want within the [], with the exception of the end tag, ]]>. Follow this link to safari for a bigger example.
Want to support the EFF and FSF by buying cool stuff? Click here.
| [reply] [d/l] [select] |
| [reply] |
Why would you want an XML parser to parse stuff that isn't XML?
Down that path lies the madness of the variously differing error-correction schemes used by the "HTML" "parsers".
And rightfully so, the XML creators said "no way" to that.
If it's an XML
parser, it's supposed to throw a fit when you feed it something other than XML. You should be feeding it XML, not non-XML.
| [reply] |
Well, your input is not well-formed xml, because it contains an ampersand. According to the xml spec it has to be written as the string "&". I think it's the cleanest way to create well-formed xml input, but you could also write some kind of preprocessor that converts the not so valid xml to xml.
| [reply] [d/l] |
Doesn't really apply to your question, but I got some great feedback on xml::simple that might be helpful way back -- here is that thread.
-CSUhockey3
| [reply] |
thanks for all your help.
I agree...in an ideal situation &,',",<,> etc should not be in the incoming XML. however i have run into some situations where this "bad" data has to be sent in and needs to be preserved from the incoming xml after it is parsed. But i cannot parse it with the reserved characters in it. I am having trouble running through the xml and converting the reserved characters: < to !40 etc. because it will convert all the < characters to !40. I need to break my xml apart so that i only call the convertchar function on the data elements of the xml and not on the tags themselves.
$incoming_xml = '
<?xml version="1.0" encoding="UTF-8"?>
<TRANSACTION>
<FIELDS>
<FIELD KEY="user">name</FIELD>
<FIELD KEY="password">pass&word</FIELD>
<FIELD KEY="operation_type"><do_what></FIELD>
</FIELDS>
</TRANSACTION>';
idea:
split the incoming xml at the >DATA TO BE CONVERTED or WHITESPACE< and just call the convertchar method on the data between the ><
I will need to be sure that i keep the xml syntax exactly how it should be. so based on my example $incoming_xml once the regex is run the converted xml should look like:
$converted_xml = '
<?xml version="1.0" encoding="UTF-8"?>
<TRANSACTION>
<FIELDS>
<FIELD KEY="user">name</FIELD>
<FIELD KEY="password">pass!38word</FIELD>
<FIELD KEY="operation_type">!40do_what!41</FIELD>
</FIELDS>
</TRANSACTION>';
once converted i will send it through XMLin then i can loop through the resulting hash and convert the !# back to their original values therefore preserving the necessary bad values.
i am having trouble creating a regex to do this though.
Once again i appreciate all your help.
| [reply] [d/l] [select] |
| [reply] [d/l] |
that works for & but it won't work for <,>,',"
regex are:
$text =~ s/&(?!amp|quot|apos|lt|gt)/!38/g;
$text =~ s/"(?!amp|quot|apos|lt|gt)/!34/g;
$text =~ s/<(?!amp|quot|apos|lt|gt)/!40/g;
$text =~ s/>(?!amp|quot|apos|lt|gt)/!41/g;
$text =~ s/'(?!amp|quot|apos|lt|gt)/!39/g;
when i run that on my xml i get:
!40?xml version=!341.0!34 encoding=!34UTF-8!34?!41
!40TRANSACTION!41
!40FIELDS!41
!40FIELD KEY=!34user!34!41name!40/FIELD!41
!40FIELD KEY=!34password!34!41pass!38word!40/FIELD!41
!40FIELD KEY=!34operation_type!34!41!40do_what!41!40/FIELD!41
!40/FIELDS!41
!40/TRANSACTION!41
so i am still in the same spot i need to be able to only manipulate the data in between > <
thanks again for your help | [reply] [d/l] [select] |
| [reply] |
XML::Simple will change all the entities for special characters into the characters when it parses text and attribute values.
You can't easily escape the reserved characters in the not-quite XML without a heuristic parser. For example, & is the entity for encoding &. You wouldn't want to turn that into &. Also, < and > are reserved characters with < and > for entities. How would you distinguish between bad characters in text and the real tags?
The only place that knows what is text and what is elements is the source of the XML. You need to fix the source of the XML that is not encoding special characters to entities in text and attribute values.
| [reply] |
Your problem, as several people have pointed out, is that the thing that's handing you data isn't handing you XML. Instead, it's almost XML. Well, if I pass my machine binary code that is almost a valid program, but not quite, I get a core dump (as I should). If someone is contractually required to be handing you XML, you should bring this up with your manager.
For the record, what the people on the other end should be sending you is:
<?xml version="1.0" encoding="UTF-8"?>
<TRANSACTION>
<FIELDS>
<FIELD KEY="user">name</FIELD>
<FIELD KEY="password">pass&word</FIELD>
<FIELD KEY="operation_type">do_what</FIELD>
</FIELDS>
</TRANSACTION>
Instead of &, they could also say & or . The point is, their process isn't generating XML.
For another example, consider that print XMLout({password => ['pass&word']}); produces:
<opt>
<password>pass&word</password>
</opt>
Assuming that the other end doesn't fix this, there are at least two ways to try to fix this in your code. One, as another poster alluded to, is to wrap the offending section in CDATA. If you happen to know that the process generating this almost-XML-but-not-quite always puts its garbage inside FIELD tags, and that the open and close FIELD is always on the same line, you could do this before sending your output to the xml processor:
$xmltext =~ s{(<FIELD .*?>)(.*)(</FIELD)}{$1<![CDATA[$2]]>$3}
Of course, this has the disadvantage that if your source ever starts to send XML documents that are in fact XML documents, with everything escaped as it should be, then your code will now read it incorrectly.
Since your input isn't XML, you could also try parsing it by something that's not an XML parser, as suggested earlier by the reference to the wonderful article The Wrong Parser for the Right Reasons. For example, stealing copiously from that article, and using the HTML::Stream class for output, here's something that turns that particular bad non-XML into proper XML. This code simply assumes that anything between <FIELD> and </FIELD> was meant to be plain text, and not pieces of tags, or whatever. I've taken the liberty of adding several messed-up password examples:
#!perl
use HTML::Parser;
use HTML::Stream;
# see HTML::Stream documentation for how to output to a scalar
$output = new HTML::Stream \*STDOUT;
# inside FIELDs, ignore stuff that looks like tags
$insidefield = 0;
my $p = HTML::Parser->new
(
xml_mode => 1,
start_h =>
[sub {
my ($tagname, $attr,$text) = @_;
if ($insidefield) {$output->text($text);}
else {$output->tag($tagname, %$attr);}
if ($tagname eq 'FIELD') {$insidefield = 1;}
}, "tagname, attr,text"],
text_h =>
[sub {
my ($text) = @_;
$output->text($text);
}, "dtext"],
end_h =>
[sub {
my ($tagname, $text) = @_;
if ($tagname eq 'FIELD') {$insidefield = 0;}
if ($insidefield) {$output->text($text);}
else {$output->tag("_$tagname");}
}, "tagname, text"],
);
$p->parse_file(\*DATA);
__END__
<TRANSACTION>
<FIELDS>
<FIELD KEY="user">name</FIELD>
<FIELD KEY="passwordlt">pass<word</FIELD>
<FIELD KEY="passwordamp">pass&word</FIELD>
<FIELD KEY="passwordgt">pass>word</FIELD>
<FIELD KEY="passwordalmost">pass&word</FIELD>
<FIELD KEY="passwordampcorrect">pass&word</FIELD>
<FIELD KEY="passwordtag">pa<ssw>ord</FIELD>
<FIELD KEY="operation_type">do_what</FIELD>
</FIELDS>
</TRANSACTION>
This code produces:
<TRANSACTION>
<FIELDS>
<FIELD KEY="user">name</FIELD>
<FIELD KEY="passwordlt">pass<word</FIELD>
<FIELD KEY="passwordamp">pass&word</FIELD>
<FIELD KEY="passwordgt">pass>word</FIELD>
<FIELD KEY="passwordalmost">pass&ampword</FIELD>
<FIELD KEY="passwordampcorrect">pass&word</FIELD>
<FIELD KEY="passwordtag">pa<ssw>ord</FIELD>
<FIELD KEY="operation_type">do_what</FIELD>
</FIELDS>
</TRANSACTION>
Note that passwordampcorrect is passed through unchanged. This is because that text was already valid. (It had everything already properly escaped - so that if your data supplier fixes their stuff, you'll still be ok) If you want to massage that content too, just change the "dtext" to "text". Also, if you're not worried about downright pathological text such as the passwordtag example, the program above can be simplified substantially by removing the bits that change the $insidefield variable and replacing the if ($insidefield) bits with the else block.
Note that this program still fails to make heads or tails of things if someone wants to make "</FIELD>" their password. When that happens, you need to really start beating on the source of your data to send you valid XML. | [reply] [d/l] [select] |