in reply to Re^3: XML::Fling begone? (ctrl, utf-8)
in thread XML::Fling begone?

Ah. No, CDATA apparently doesn't help, and it would indeed be useless with attributes.

And indeed, Genx refuses to put control characters in the output stream.

However, I found that an AddText call with an empty string seems to consistently coerce Genx to flush whatever it had in buffer. At that point I can sneak anything I want into the stream before I resume business as usual:

my $str = ''; my $w = XML::Genx->new; eval { $w->StartDocSender( sub { $str .= shift; } ); $w->StartElementLiteral( 'foo' ); $w->AddText( 'bar' ); $w->AddText( '' ); $str .= chr 1; $w->AddText( 'baz' ); $w->EndElement; $w->EndDocument; }; die "Writing XML failed: $@" if $@;

This works as expected and allows to send control characters in node content, though not in attributes.

Another alternative whose viability I can't tell is that Genx provides a genxScrubText call which simply brushes out anything illegal. Would that be acceptable? (I do wonder why we need to make it possible to send control characters in the tickers.) However, the problem here is that XML::Genx currently doesn't bind that function.

As for XML style, I agree that attributes should be avoided. I didn't understand this when I first learned of XML, but I've come to appreciate why it is common wisdom among more insightful people. Mixed content is also a pain when you're dealing with structured rather than “document-ish” data.

Makeshifts last the longest.

Replies are listed 'Best First'.
Re^5: XML::Fling begone? (ctrl, utf-8)
by tye (Sage) on Dec 22, 2004 at 16:40 UTC

    That's a lot of thrash. I don't remember, but was there a problem you were trying to solve?

    I want control characters in XML because there is nothing preventing people from submitting control characters in their HTTP and when they do that they are likely to not get what they wanted and so it can be helpful to see what they actually submitted instead of the limited subset that XML deigns to support.

    The XML version of nodes (etc.) is supposed to give one the raw data. Having it throw away any of the raw data without a really good reason just leads to it not being trustworthy and other means having to be invented and used and forgetting to use them and yuck.

    - tye        

      There is no outright problem, but better performance for the ticker generation won't hurt.

      Additionally, while you might not think so much of guaranteed compliance, there are parsers around which will complain about control characters as they should. I use one of those: XML::LibXML (which I cannot praise highly enough). I've had trouble with my NN client because of control characters in the ticker once or twice.

      The Javascript chatterbox client I wrote necessarily relies on the browser's parser, which means breakage on control characters in the stream at least in case of Mozilla-based browsers. I don't know how non-compliant MS' parser is in this instance.

      Here are my thoughts.

      Readers of the site are not interested in debugging faulty posts. In the common case, scrubbing the text should be just fine. XML::Genx does not currently bind the appropriate function; I am considering writing a patch.

      For debugging purposes, there might be a textscrub=0 parameter. It still wouldn't produce illegal characters, instead it will wrap them in <char ord="##"/> elements. It's likely a human is going to be looking at the XML source directly in those cases anyway, so interpretation shouldn't be an issue.

      The code to do that efficiently would be something like

      eval { $w->AddText( $text ) }; if( $@ ) { for( map ord, split //, $text ) { eval { $w->AddCharacter( $_ ) }; if( $@ ) { $w->StartElementLiteral( 'char' ); $w->AddAttributeLiteral( '', ord => $_ ); $w->EndElement(); } } }

      It's a mouthful, but due to reliance on exceptions will only rarely ever need to fall through to the hard parts.

      It might also be an option to forgo the textscrub=0 business altogether and do this for everyone, though old and/or clients might be more confused by these newfangled char elements popping up occasionally than they would have been with invalid XML.

      Obviously this scheme is no help for attribute values, but as discussed before, we will eventually be using (nearly) attribute-free markup anyway. In particular, no user data would appear in attribute values. In that case, the consideration is what to do about old clients which do not understand the new markup format; I believe there's good reason to continue supporting them for a while, but I don't think it's a good idea to submit to the boundaries created by old mistakes forever. Deprecating the old-style ticker markup and giving people due notice of maybe six months before discontinuing support should be sufficient. (There's precedent with the private message ticker, too.)

      Makeshifts last the longest.

        Additionally, while you might not think so much of guaranteed compliance,

        I'll guess that the "guarantee" is that no matter how stupidly you try to use Genx, it refuses to produce invalid XML (in so far as the code is bug free in specification as well as implementation).

        Other than control characters, in what ways are we currently non-compliant? I'm not aware of any gain to be had there (I'll get to control characters shortly).

        So if someone decides to do something really stupid, then Genx will guarantee the output will be either empty or compliant, really stupid XML.

        Standards are great because of the benefits they provide. So standards compliance is a secondary goal, one that you shoot for because it facilitates many primary goals (mostly flavors of interoperability). Getting most or all of the benefits that are supposed to come with compliance is the primary goal. Putting a secondary goal ahead of your primary goals is a common mistake I see and that I try to avoid.

        there are parsers around which will complain about control characters as they should.

        I said as much. The last time this came up I proposed that we default to stripping control characters and have an option to request XML 1.1 which would preserve control characters.

        Part of the reason that I think that the default should strip control characters is because being XML compliant is important. I do consider compliance important, I just don't blindly put it ahead of reaping real benefits.

        Well, XML 1.1 probably hasn't received final approval yet so Genx surely refuses to produce it.

        <char ord="##"/>

        Genx will probably allow such to be output since it is a good example of compliant, stupid XML. (: I'm certainly not aware of any XML parsers that will translate that back into the proper characters. You've broken single fields into multiple pieces such that they are a pain to put back together. You lost the primary goal by concentrating on a secondary goal.

        So I'm still not sure what goals you have here. In some ways, your goals appear to be "use Genx" and "ditch XML::Fling". Useless goals. If one of your goals is compliance, than patch things to strip control characters by default (if that still isn't the case).

        If your goal is performance gain then please at least demonstrate one instead of guessing that there is one. But currently I doubt that would be enough to overcome the drawbacks of Genx feature-wise.

        I'd like to offer UTF-8 XML but Latin-1 has advantages in some cases so I don't want to stop offering it, especially since it is what we've produced for so long.

        If a future version of Genx supports Latin-1 and XML 1.1, then it might make a good replacement.

        As things stand, if you have some strong desire to use Genx, then you'd need to make Genx just an option, while still supporting our current methods.

        - tye