OK, where should I start?
From the most generic, and probably the most useful, to the most specific:
- you will increase greatly the number of answers, you get, and their accuracy, if you give us more data to work with: usually people post a sample of data, and an extract of the code they wrote, plus the expected output and the exact error messages they get,
- then you should use the preview to make sure the message you posted displays properly, as it is I read (For eg. XML file has "&" for "&") where I guess you wanted (For eg. XML file has "&" for "&"),
- then you don't seem to know much about XML: you can't exect to change randomly the format of the input data and expect it to still be valid XML. You can get more information on XML at zvon.org, for example, or read a book or two about it. Learning XML or Perl & XML are 2 O'Reilly books that you might find useful too. Those &...; constructs are called entities, and they could be character entities (like   or Š), default entities like & or <, internal entities, external entities... there is more to it than just "weird characters that don't look good",
- also XML::Parser is probably not the module you should be using, XML::Simple or XML::LibXML are 2 examples of higher level modules that will make your life much easier than XML::Parser, look at the Module Reviews section for reviews of XML modules,
- finally, I believe that if you let XML::Parser, or any other XML module, read your initial XML file, provided it is really well-formed XML, you will be happily surprised by the data you get: the character entities should have been replaced and you should get what you wanted. Watch out for the encoding of the data though.
| [reply] |
But XML::Parser expects "&" to look like "&", too! And "<" like "<". The rest is up to you...
So, after you decoded the entities, make sure to reencode at least these two, again.
If your character set doesn't agree with the XML files' content-encoding (which is UTF-8 by default), it's best to replace the special characters by numerical entities, too. For example, it's very safe to replace "à" with "à", for example. The character ordinal code is in Unicode, of which ISO-Latin-1 is a subset for the range 0-255. | [reply] [d/l] [select] |
You shouldn't be passing XML through decode_entities. The result will be not be an well-formed XML document. Any & entities will be turned into & and those are must be escaped with
XML::Parser will decode entities in the XML document. The NoExpand option controls this. The standard entities will be handled automatically. If this is all you have in your document, then you don't need to worry about them.
If the XML document contains any HTML entities other than the standard three, you will need to handle those specially. The XML standard way is to declare them in a DTD either an internal DTD subset or external file. Declaring entities is one of the last uses for DTDs. Unluckily, using DTDs either in an internal subset or external file adds some complexity.
There have been some proposals to handle character entities without DTDs but none of them have been accepted.
| [reply] [d/l] [select] |