•Re: Forcing XML to validate
by merlyn (Sage) on Jul 02, 2004 at 11:01 UTC
|
Anything that isn't well-formed isn't XML, and won't be parsed by any compliant XML parser. This is good. This is by design. Get the thing that is generating something-that-kinda-looks-like-XML to spit out the right stuff, and all will be better.
| [reply] |
|
|
As I said, I have no control over what is generating the data. I would love to just pass this by, and not deal with the people that do, but unfortunatly that is not an option.
I am not trying to force XML::Simple to parse invalid XML. I am trying to throw away the invalid parts, so that there is only valid XML left.
Then everything will be good.
I was just wondering if there was already something out there to do this.
| [reply] |
|
|
I'm not talking about control. I'm talking about cooperation. If someone is spitting out stuff that they think is XML, or advertised as XML, then it should be XML. Inform the folks generating the data that they need to do some work. That way, you're not trying to error-correct for them, which each consumer will have to do separately. Down That Path Lie Monsters and Madness, and is the very reason that nothing is XML until it is XML, unlike the HTML tag soup we had before it.
| [reply] |
|
|
Jon Udell likes to use HTML Tidy to clean up XHTML/XML:
http://www.infoworld.com/article/04/05/28/22OPstrategic_1.html
http://tidy.sourceforge.net/
| [reply] |
Re: Forcing XML to validate
by exussum0 (Vicar) on Jul 02, 2004 at 12:26 UTC
|
It sounds like you have an encoding problem. First suggestion would be is, find out what your personal encodign is, and regexp/tr away the invalid crap.
i.e. if you are using 7 bit ascii only, then killing off anything over 127 would at least clean things up a bit. Just make sure not to kill off the XML structuring symbols, such as #, (space), ",', >, < and the other chars by accident. Try egress filtering.. leaving only what you want in, in.
I feel your pain. tye's scratchpad, while it may work with some xml parsers, has char 0x001, or whatever char isn't liked by mine, per the options of the spec.. so i have to clean it up when parsing on my end.
Update: just testing something w/ katterbox - sporty's scratchpad
Bart: God, Schmod. I want my monkey-man.
| [reply] |
Re: Forcing XML to validate
by thor (Priest) on Jul 02, 2004 at 12:27 UTC
|
It seems that you have a pattern to what is wrong with your XML. I'm not good enough at XML to be able to look at your error and visualize the bustedness, but I'd imagine that you could modify your data coming in so that it is compliant before passing it to the parser. For instance, if you have something like this...
< foo>stuff< /foo>
..and that turns out to be not valid because of the space after the '<', you could do something like:
$xml =~ s/<\s+/</g
However, as I said, this is just a guess since I haven't seen what kind of invalid XML you actually have.
| [reply] [d/l] |
Re: Forcing XML to validate
by iburrell (Chaplain) on Jul 02, 2004 at 16:13 UTC
|
What does the XML that isn't parsing look like? That isn't a very helpful error and could be caused by all kinds of different things. The limited fix depends on what the error is.
For example, it could be > that wasn't encoded where it should have been. The parser thinks it starts an element, and but has a space and normal text around.
Or it could have been a non-ASCII in a name. This is allowed in XML 1.1, but that is uncommon. Stripping out the high-bit characters would do help if you don't mind losing all non-ASCII characters.
Your regex is wrong, although I am assuming perl monks is mangling the content and putting a weird character. s/\W// removes all non-word characters. Word characters are alphanumeric plus '_', basically identifiers. That regex will remove all markup and whitespace. Not good.
| [reply] [d/l] |
Re: Forcing XML to validate
by drfrog (Deacon) on Jul 02, 2004 at 17:03 UTC
|
| [reply] |
Re: Forcing XML to validate
by graff (Chancellor) on Jul 05, 2004 at 20:30 UTC
|
If you have various Perl XML modules installed from CPAN, then you are bound to have the "expat" package installed as well, because a lot of the CPAN XML modules depend on this package.
The "expat" distribution includes a utility called "xmlwf", which, according to its man page, "determines if an XML document is well-formed". (The description also says "It is non-validating." But there's a chance that your input data is actually not well-formed, and if that's true, then xmlwf might give you a better idea where the problems are in the data.)
There are lots of command-line options for controlling what xmlwf does with your input data -- read the man page and give it a try.
Once you figure out what's wrong with the data, you can either complain to the data provider(s) with specific issues, or else cook up a perl script (not using an XML module) that will do surgical edits of the data to make it XML-parsible. | [reply] |
Re: Forcing XML to validate
by Grundle (Scribe) on Feb 02, 2005 at 18:16 UTC
|
I have recently encountered this same problem, but the issue for me is a little more esoteric. I believe that yes, it is important for the XML parser to quit if there is malformed XML data, but the problem for me is with the execution completely stopping altogether.
If I am running through a list of URL's that are XML data and it happens that one of the XML files is out dated and there is an HTML file in place indicating this change, of course the program is going to die. What happens next? You run your program again and it will die at the same place...so it must ignore any non-XML and/or mal-formed XML files.
For non-XML files the test is easy. You can do something like
if(($xml_data =~ m/<\?xml version/){ .. }
For actual XML integrity itself there needs to be some measure, so that you can skip parsing if it will in fact fail during parse. It would be nice to modify the expat itself so that instead of dying it will return a failed status. This would give more control to the programmer and is stylisticallly better. | [reply] [d/l] |
|
|
There's no need to modify expat to stop your program dying on the first error it encounters - that's what eval is for. This is covered in the Perl-XML FAQ.
Also, the '<?xml ...' declaration at the top of an XML file is optional and frequently omitted. Don't rely on it being there.
| [reply] |