John M. Dlugosz has asked for the wisdom of the Perl Monks concerning the following question:

Suppose I'm processing a file with XML::Twig, and my code figures out something it doesn't like, such as a duplicate chapter name or other semantic rule that's not the domain of the actual XML parser.

sub on_chapter { my ($t,$e)= @_; my $name= $e->att ('name'); if ( i don't like the name ) { my $errorpoint= what do I call here? print "Warning: crummy chapter name in file $myfile line $errorpoi +nt\n"; }
How do I get the information to tell the user where in the input file I found this at? Ideally the line number, since that's what we're used to and easy to find in a text editor.

—John

Replies are listed 'Best First'.
Re: XML::Twig error reporting
by mirod (Canon) on Nov 06, 2001 at 11:52 UTC

    Good question! This is not (yet!) documented, but the expat object, which gives you access to all of the Expat methods, including the current line and column numbers can be accessed through the twig: it is in $t->{twig_parser}. So getting t->{twig_parser}->current_line will give you the current line number. There is one caveat though: twig_handlers are called when an element is completely parsed (so you can process its content), so you will get the position of the closing tag, which is of course enough to locate the element, but might not be the most convenient way to then edit the document. So you might want to "annotate" the document with the position for each tag, or at least each tag in which you are interested.

    By the way, XML::Parser::Expat, which calls Expat to actually reading the XML, does not set $. so you can't use it.

    So here is a version that properly outputs the line/column number for the opening element. If the line/number for the closing element is OK then you don't need the start_tag_handler and you can get the position in the elt handler, and if you are concerned about size you might want to limit calls to the start_tag_handler to those elements that you check late.

    #!/bin/perl -w use strict; use XML::Twig; my $t= new XML::Twig( # called for all opening tags start_tag_handlers => { _all_ => \&store_position }, # called for each closing elt tag twig_handlers => { elt => \&elt}); $t->parse( \*DATA); sub store_position { my( $t, $elt)= @_; my $line = $t->{twig_parser}->current_line; # $t->{tw +ig_parser} is the expat object my $column = $t->{twig_parser}->current_column; $elt->{my_atts}= { line => $line, column => $column }; # crude b +ut works } sub elt { my( $t, $elt)= @_; if( my $error= $elt->att( 'error')) { my $line = $elt->{my_atts}->{line}; my $column = $elt->{my_atts}->{column}; print STDERR "error $error at $line:$column\n"; } } __DATA__ <doc> <elt>this one is OK</elt> <elt error="foo">not this one though</elt> <elt>OK</elt> <elt error="bar">here is a bar error</elt> </doc>

    By the way, I have a question on this last piece of code: in order to store the position information I simply use a new field in the hash (my_atts). This is convenient but hardly robust: what if the object implementation changes to a blessed scalar or a closure? Or if it uses a "my_atts" field? What would be a better way? Inheritance seems difficult, as the elements are created and processed by XML::Twig. Should XML::Twig document a field that can be used for this, both for twigs and for elements?

      Thanks, that's exactly what I need.

      For recording the information in the element, I would use the regular att() feature. E.g. $elt->set_att('#line', $line);

      That follows the example of #PCDATA which uses # for a "special" name used like an identifier.

      A more direct answer to the last question is yes, document an extension mechanism rather than relying on the object's implementation. Simply providing a hashref where users can store their stuff is an sufficient. A fancier way would be to provide a way to manage it so different users don't clobber each other, but convention can do just as well: tell them to use their fully-qualified module name as the start of the key.

      —John

      A thought: making special things like "#line" stored with attributes, as opposed to some other type of mechanism, means that it will work with all the selection and filtering mechanisms.

      All you need is a switch so printing will skip these "special" attributes, denoted by having illegal names.

      —John

        I have to think about it. The problem I see is that, although this is a usefull and clever trick, it would probably be used quite infrequently, while slowing down every print or sprint... Though as I would limit it to attributes starting with #, it would only cost one substr() per attribute. I think silently removing all illegal attributes is too dangerous for the user, and checking them might get me into Unicode trouble. Now what about elements names starting with #

        BTW if I go this route I might as well add an option to generate the line/column attributes ;--)

Re: XML::Twig error reporting
by Fastolfe (Vicar) on Nov 06, 2001 at 02:57 UTC
    I think this information is lost when parsing a structured data format like XML, but I wonder what's in $.? If XML::Twig is parsing data from an input file, and these handlers are called as it's reading data, $. should be set with the current line number (where the closing tag lives). If it reads the data in first, though, you're out of luck.