in reply to tgrep - A grep for HTML tags

Man oh Man! How many times will I have to explain that if you want to do XML processing you have to use XML tools... NOT REGULAR EXPRESSIONS!

And if you want to do XML processing you have to start by learning what XML is.

Specifically:

Overall it's not like this does a terribly bad job at pseudo-parsing XML, just that why bother writing your own broken code when you could re-use existing, correct code. Especially in this case where a pretty simple SAX filter (or XML::Twig script of course, see below ;--) would work.

try perl tgrep -attr show elt tgrep.xml on this (well-formed) XML file:

<?xml version="1.0"?> <doc> <!-- <elt show="NOK1"> --> <!-- comment + --> <elt>text</elt> <!-- should not be fo +und --> <elt show="ok1"/> <!-- regular + --> <elt><![CDATA[<elt show="NOK2">text]]></elt> <!-- CDATA section + --> <elt show = "ok2" /> <!-- spaces around = + --> <elt show = "ok3" /> <!-- 2 spaces before +att --> <elt show='ok4' /> <!-- use ' instead of + " --> <elt SHOW='NOK3' /> <!-- upper case attri +bute name --> <ELT show='NOK4' /> <!-- upper case eleme +nt name --> <elt att=" show NOK5"/> <!-- use attribute na +me in value --> <elt odd=">" att="ok5"/> <!-- use > in attribu +te value --> </doc>

So here is a very simple XML::Twig script that would just do the same as , except it works on the test file:

#!/usr/bin/perl -w use strict; use XML::Twig; my $tag= shift @ARGV; my $att= shift @ARGV; # without the sprintf the expression looks real ugly # because of the interferences between XPath and Perl # syntaxes: "$tag\[\@$att]" my $path= sprintf( "%s[@%s]", $tag, $att); my $t= XML::Twig->new( start_tag_handlers => { $path => sub { print $_ +[0]->original_string, "\n"; }}); $t->parsefile( shift @ARGV);

Replies are listed 'Best First'.
Re^2: tgrep - A grep for XML/HTML tags
by adrianh (Chancellor) on Dec 02, 2002 at 14:42 UTC

    :-)

    You're quite correct. I'll add appropriate disclaimers to the code.

    It doesn't use the XML modules for two reasons. One reasonable, one not:

    1. The script had its origins in some perl4 code I wrote in the mid-nineties. I've only ever changed it when it's not done what I wanted (I know, I should refactor more :-)

    2. I use it mostly for dealing with HTML. Broken HTML at that.

    It does what I need it to do. You're right about it not handling XML properly. I guess I've internalised the limitations of tgrep and know when to use it and when to write some XSLT (or whatever).

    There's probably a lesson in that for me.