http://qs1969.pair.com?node_id=243835


in reply to Are you looking at XML processing the right way?
in thread is XML too hard?

Try to do a merge sort on two streams that give you records via call backs. It is just plain impossible. Callbacks are superficially/psychologically different than iterators, true. But they are also fundamentally less flexible. So your comparison only applies in the simple cases. In the complex cases, you find your hands tied and your work much more difficult.

But, there is no reason why XML modules can't be cast as iterators instead of forcing you to use callbacks. That is just too-lazy/not-smart-enough module design (it just takes a bit more work and knowing enough to realize that it is possible and can be important). With an interator-based XML module, things will be better.

But that would still force a linear approach which isn't as flexible as the (often inefficient) data structure approach. Providing ways to 'seek' your XML iterators can help with some cases but not others.

You can also come from the other end of the spectrum and make the data structure versions more efficient by having them compute things only as they are needed which is also one step on the road to being able to not keep everthing in RAM at once. On-demand data structures for XML is probably about the closest you can come to "the best of both worlds". Of course, the complexity involved means that it will never be as efficient/convenient when a much simpler approach fits what needs to be done. But that difference in efficiency is not something that I find worth worrying about (though I do prefer having choices for the sake of convenience).

It appears that XML::Twig tries to be several of these points on the spectrum. I parse XML with regular expressions1 so I've never used it, but I hear good things about it. (:

                - tye

1And I suspect many that say you shouldn't parse XML with a regex don't know how to do it right. For example, "ways to rome" just does it wrong. Of course you shouldn't do it that way!

Replies are listed 'Best First'.
Re: Re: Are you looking at XML processing the right way? (merge)
by mirod (Canon) on Mar 18, 2003 at 08:22 UTC
    For example, "ways to rome" just does it wrong. Of course you shouldn't do it that way!

    By any means, send an alternate way. No kidding. If the article can show a safer way to do it with regexps, then it should.

    As a matter of fact I use regexps to process XML in very specific cases. First, when I am processing "not-quite-XML", which would make a parser choke, I use regexps to turn it into real XML. Then when I need to do things that XML modules (even XML::Twig!) can't do AND when I have created the XML myself. That is, it uses no entities/comment/weird stuff at all, it has no recursive tags and I know the order of attributes in tags. Then I use regexps (or I upgrade XML::Twig, see the new wrap_children method ;--)

    The problem is people who don't know XML besides "it's HTML with my own tags" (and very few people actually know HTML that well), who use regexps for recuring processes, where the likelyhood of the XML changing in the future in ways that will break their code is quite high (<peet_peeve>mind you, the DOM has very similar issues</peet_peeve>). It is not a problem of regexps=bad, it's just a problem of knowing your tools, knowing the problem space (and there are indications that Tim Bray knows the problem space ;--) and knowing the limitations of the tools in the context in which you are using them. When you don't quite know the environment, better to play it safe and to use a parser than regexp.

Re: Re: Are you looking at XML processing the right way? (merge)
by herveus (Prior) on Mar 18, 2003 at 14:52 UTC
    Howdy!

    From a database perspective, XML is a species of hierarchical database. If you need to look at it from any angle other than the tree structure expressed in the nesting of the tags, you have a tedious problem. I'm not making a value judgement, per se, but you do have to keep this in mind. If you are trying to force a relational model into an XML format, you should expect "interesting" times.

    I'm curious how one does a merge sort on two XML streams. If the structure can be parsed as a linear series of records, you have something to merge. It all depends on the trees. If you have a bunch of sticks without branches, you may have something. If you have a bunch of heavily branched shrubbery, you have a hard problem, unless you are taking paths from the root to each leaf as a record.

    The "looks like an iterator" approach sounds interesting.

    Now, I'm speaking in the general case. If the particular XML you work with is more tightly constrained, you can take advantage of that to better structure your code.

    I've used XML::Twig for some parsing I do. I only need a subset of the data; it gives it to me without too much hassle. On the other hand, it did force me to syntax my invert a bit. :)

    yours,
    Michael

Re2: Are you looking at XML processing the right way? (merge)
by dragonchild (Archbishop) on Mar 18, 2003 at 15:28 UTC
    Can you give me an example of two streams that would have a merge sort needed that would be impossible? (I'm still relatively new to XML processing, but it would seem that merge-sorting would be very simple with N streams using callbacks, at least theoretically...)

    ------
    We are the carpenters and bricklayers of the Information Age.

    Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.

    Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

      The point about "merge sort" is about callbacks versus iterators not about XML. Compare:

      # Code using iterator: while( <INPUT> ) { process_line( $_ ); } # Code using call-back: File::ProcessLines( \*INPUT, sub { process_line($_) } );
      and you see that the differences appear rather superficial and psychological.

      but it would seem that merge-sorting would be very simple with N streams using callbacks

      No, it is impossible. Using callbacks means that you have to completely process one stream before you get control back to process another stream. You can't process 2 streams at once with callbacks, much less N streams.

      Consider this code:

      # A merge sort: my $r1= <$i1>; my $r2= <$i2>; while( ! eof($i1) && ! eof($i2) ) { my $cmp= $r1 cmp $r2; print $cmp le 0 ? $r1 : $r2; $r1= <$i1> if $cmp le 0; $r2= <$i2> if 0 le $cmp; } # ...
      Now rewrite the above using File::ProcessLines and callbacks. You can't. It is impossible. To do it requires continuations which Perl doesn't have. Let's try:
      File::ProcessLines( $i1, sub { # sub1 my $r1= shift(@_); File::ProcessLines( $i2, sub { # sub2 my $r2= shift(@_); if( $r1 lt $r2 ) { return_from_sub1_but_not_from_sub2; # ... } ); } );
      So we can go as far as getting the first two records of each stream. But to get the second record of the first stream requires us to return from the first callback which won't happen until the entire second stream is processed.

      The point is that callbacks are not just harder to use, they are also fundamentally less flexible. They require processing be done in an extremely restricted linear order and make it very unnatural to even share state between the callbacks.

      Iterators are more flexible. Iterators that can seek are even more flexible. A random access data structure is still more flexible.

      Now, for an XML example. Assume there is some web site that discusses Perl. Assume also that this site has a chatterbox and you can get the last 10 lines of chatter from a XML ticker. You fetch chatter, wait a while and fetch it again. Now you want to combine those two. You could certainly use a merge sort for that. But that is impossible using callbacks.

      There are other ways you could merge such data. In this case, the data is only 10 lines so not being able to use merge sort isn't a huge problem. Also, the data should only overlap in one chunk, so you could even use callbacks to do this merging but it would be much more difficult than if you used a more flexible interface (and it would be impossible to make it deal well with some exceptions).

      Let's also assume that there are several people who have written their own chatter archiving systems. But each system has periods of down time for various reasons. Now you want to combine these archives to get as complete an archive as possible. They've each stored their data in different formats, of course. The obvious solution is to have each site send you their data in XML; it is nearly the canonical example for what XML is useful for. Now you have a case where a merge sort is important. But your XML parser only supports callbacks. So you are forced to convert each stream into something other than XML and then merge the new streams. What a waste.

      Note that callbacks can also be used where the first call does not process the entire stream before returning. This gives you a bit of a combination between callbacks and iterators (you 'iterate' to the next chunk which causes one or more of your callbacks to be called). So you can iterate whatever the "chunks" are but are forced to process each chunk using callbacks.

      In summary: Yes, callbacks are fundamentally one of the least flexible interfaces you can provide. They make it easy for the module writer to provide the interface and make it hard for the module user to use the interface. And it is not just a matter of "getting used to" using callbacks.

                      - tye
        Now that makes more sense. But, it seems that it's not that you're missing continuations (though those would certainly help) ... it's like you're missing a layer of control over the stream itself. My naive thought was that I could tell the stream "Don't process another chunk ... I'm still working on the one you just gave me, but I need to be able to do other stuff, too." Maybe, naive is the word for it.

        It sounds like there's the need for a event listener. Each stream would issue events and the listener would reap them, as appropriate. Every time an event from a stream is reaped, the stream is informed and it can then send up another event to be reaped.

        The callback part of this would be that you register with the listener that you're interested in XYZ event from streams A, B, and C. The listener would discard any event from the streams that is not being listened for by anybody.

        Does this already exist? Can this exist? (Well, I'm pretty sure it can, cause that's how modern OS's work ...)

        ------
        We are the carpenters and bricklayers of the Information Age.

        Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.

        Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.