comment on

Man oh Man! How many times will I have to explain that if you want to do XML processing you have to use XML tools... NOT REGULAR EXPRESSIONS!

And if you want to do XML processing you have to start by learning what XML is.

Specifically:

the code doesn't process comments and CDATA sections properly,
XML (and thus XHTML) is case sensitive, so ELT is not the same as elt,
the code doesn't deal with entities,
it ignores encodings,
(<$TAG[^>]*>?) does not match a tag, nor does <$TAG.*?>: you can perfectly use '>' in an attribute value: <tag att=">"> is a valid XMl tag,
finally the line $has_attribute = 1 if $tag =~ m/\s+$name[^\w]/si; is just plain wrong as it catches any occurence of the attribute name in the tag, even in a value.

Overall it's not like this does a terribly bad job at pseudo-parsing XML, just that why bother writing your own broken code when you could re-use existing, correct code. Especially in this case where a pretty simple SAX filter (or XML::Twig script of course, see below ;--) would work.

try perl tgrep -attr show elt tgrep.xml on this (well-formed) XML file:

<?xml version="1.0"?>
<doc>
    <!-- <elt show="NOK1"> -->                   <!-- comment         
+            -->
    <elt>text</elt>                              <!-- should not be fo
+und         -->
    <elt show="ok1"/>                            <!-- regular         
+            -->
    <elt><![CDATA[<elt show="NOK2">text]]></elt> <!-- CDATA section   
+            -->
    <elt show = "ok2" />                         <!-- spaces around = 
+            -->
    <elt  show = "ok3" />                        <!-- 2 spaces before 
+att         --> 
    <elt show='ok4' />                           <!-- use ' instead of
+ "          -->
    <elt SHOW='NOK3' />                          <!-- upper case attri
+bute name   -->
    <ELT show='NOK4' />                          <!-- upper case eleme
+nt name     -->
    <elt att=" show NOK5"/>                      <!-- use attribute na
+me in value -->
    <elt odd=">" att="ok5"/>                     <!-- use > in attribu
+te value    -->
</doc>
[download]

So here is a very simple XML::Twig script that would just do the same as , except it works on the test file:

#!/usr/bin/perl -w
use strict;

use XML::Twig;

my $tag= shift @ARGV;
my $att= shift @ARGV;

# without the sprintf the expression looks real ugly
# because of the interferences between XPath and Perl
# syntaxes: "$tag\[\@$att]"
my $path= sprintf( "%s[@%s]", $tag, $att);

my $t= XML::Twig->new( start_tag_handlers => { $path => sub { print $_
+[0]->original_string, "\n"; }});
$t->parsefile( shift @ARGV);
[download]

In reply to Re: tgrep - A grep for XML/HTML tags by mirod
in thread tgrep - A grep for HTML tags by adrianh

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.