Re: Pair Tag missing
by graff (Chancellor) on May 31, 2005 at 03:56 UTC
|
Contrary to jpeg's suggestion, I'd suggest that you use a stack for keeping track of tags. Each time you see an open-tag, push it on the stack; each time you see a close tag, check the last thing on the stack and see if it's the same sort of tag.
If it's a match, you're fine, and you can pop that last element off the stack and move on. If it's not a match, the next issue is: for this unmatched end-tag, check further along the stack to see if you do find a matching open tag; if so, then all open tags from that point to the end are probably lacking their end-tags. If the current end tag has no match at all in the stack, then you know you're missing an open tag for it.
And to contradict jpeg yet again, here's an example of the technique:
#!/usr/bin/perl
use strict;
$/=undef;
$_=<>;
my @stack = ();
my $offset = 0;
while (( my $i = index( $_, "<" )) >= 0 ) {
$offset += $i;
$_ = substr( $_, $i );
if ( s{^(<(\w+).*?>)}{} ) {
$offset += length( $1 );
push @stack, $2;
}
elsif( s{^(</(\w+)>)}{} ) {
my $et = $2;
if ( $stack[$#stack] eq $et ) {
pop @stack;
}
elsif ( grep( /$et/, @stack )) {
print "missing end-tags for:";
while ( @stack and $stack[$#stack] ne $et ) {
print " ".pop @stack;
}
print " at </$et> (offset: $offset)\n";
}
else {
print "missing open-tag for $et (offset: $offset)\n";
}
$offset += length( $1 );
}
}
### updated: added condition on inner while loop to check for empty st
+ack
Now, the results printed by that approach can be inaccurate or misleading under certain circumstances, but you will at least get a reasonable look at where the problems start.
And of course, if you have data with lots of elaborate stuff in the tags (e.g. a close-angle-bracket inside a quoted string that is part of an attribute value in an open tag), then this approach will be thrown off totally, and you'll need to parse the input more carefully. Good luck with that.
(One more update: it's possible that there might be open-angle brackets in the text, which are not intended as the beginning of a tag -- this isn't supposed to happen, the text is supposed to use "<" instead of a bare "<", but hey, it happens, and it will also cause this script to fail, or at least create a lot of false-alarm error reports. Perhaps that's just as well...) | [reply] [d/l] |
|
|
| [reply] [d/l] [select] |
Re: Pair Tag missing
by jpeg (Chaplain) on May 31, 2005 at 02:45 UTC
|
I doubt anyone here is going to do your homework for you (or write your code for you)....
The task sounds easy enough. You can do it. Where are you stuck? Post some code! Show us what you've tried so far, and someone will guide you!
Try starting with a hash, using the tags as keys.
| [reply] |
Re: Pair Tag missing
by Adrade (Pilgrim) on May 31, 2005 at 04:41 UTC
|
my $data; $data .= $_ while (<>);
for ( ($data =~ m/<([^\/>]+)/sg) ) {
push(@opened, $_) if ! ($data =~ s/<\/$_>//s);
}
print "Opened: ", join(' - ', @opened), "\n";
print "Closed: ", join(' - ', ($data =~ m/<\/([^>]+)/sg) ), "\n";
Obviously not for complex tags - :-) Best, -Adam | [reply] [d/l] |
Re: Pair Tag missing
by tlm (Prior) on May 31, 2005 at 02:42 UTC
|
I am not sure what your question is, but maybe HTML::Parser or HTML::TokeParser will be of use to you. Actually, come to think of it, these I've used these modules to parse HTML that I know to be valid, not to check the validity of potentially invalid HTML; maybe you should look into something like HMTL::Validator, though I don't have experience with this module.
| [reply] |
Re: Pair Tag missing
by GrandFather (Saint) on May 31, 2005 at 05:10 UTC
|
| [reply] |
Re: Pair Tag missing
by ambrus (Abbot) on May 31, 2005 at 09:55 UTC
|
#!perl
use warnings;
use strict;
sub rfindi (&@) {
my $th = $_[0];
for my $i (1 .. @_ - 1) {
&$th($_[@_ - $i]) and
return @_ - $i - 1;
}
return;
}
my @t;
my $e = 0;
while (local $_ = <>) {
while (m(<(/)?(\w+))g) {
my $t = uc $2;
if (!defined($1)) {
push @t, $t;
} else {
if ($t eq $t[-1]) {
pop @t;
} elsif (defined(my $i = rfindi { $_[0] eq $t
+} @t)) {
$e = 1;
warn "unclosed ", join(",", @t[$i + 1
+.. @t - 1]), " tags at ", $t,
" end tag";
delete @t[$i .. @t - 1];
} else {
$e = 1;
warn "closing unopened ", $t, " tag";
}
}
#warn "[$.:$+[0]: @t]\n";
}
}
@t and do {
$e = 1;
warn "unclosed ", join(",", @t), " tags at eof";
};
exit $e;
__END__
| [reply] [d/l] |
Re: Pair Tag missing
by Tomtom (Scribe) on May 31, 2005 at 09:35 UTC
|
Couldn't you just have a recursive sub calling itself when it finds an opening tag, and returning when it finds the corresponding closing tag ?
That way, you could report errors when you get a closing tag that doesn't correspond to the current opening tag, for example. | [reply] |
Re: Pair Tag missing
by Elijah (Hermit) on May 31, 2005 at 03:51 UTC
|
Something like the following would probably work for you. Just handle the error anyway you need to for your implementation.
#!/usr/bin/perl -w
use strict;
my ($tagOpen, $tagClosed, $parseData);
open(FILE, "<", 'htmlFile.html') ||
die "Error reading file: ($!)\n";
$parseData .= $_."\n" while (<FILE>);
close(FILE);
my @tags = qw(TITLE AUTHOR H1 H2 P IT);
for (@tags) {
$tagOpen++ while ($parseData =~ /<$_>/g);
$tagClosed++ while ($parseData =~ /<\/$_>/g);
error($_, $tagOpen, $tagClosed) unless ($tagOpen == $tagClosed);
$tagOpen = $tagClosed = 0;
}
sub error {
my ($tag, $open, $closed) = @_;
print "Error found in tag: <$tag> (Open: $open -- Closed: $closed)\
+n";
return;
}
| [reply] [d/l] |
|
|
Apart from the problem of having to specify all the possible tag names up front in an array, this approach will fail to pick up on certain problems that are quite common, such as:
<foo><bar> blah blah </foo></bar>
In fact, by this approach, a file where open and close tags are positioned randomly will pass just fine, so long as the number of close tags matches the number of open tags for each tag name. | [reply] [d/l] |
|
|
> Apart from the problem of having to specify all the possible tag names up front in an array, this approach
> will fail to pick up on certain problems that are quite common, such as:
>
> <foo><bar> blah blah </foo></bar>
True, but again all he said he was looking for was to see if all open tags had a close tag, not if the ordered syntax was correct. Also he stated that he was only searching for a few tags and NOT the entire range of possible HTML tags so the array approach would be sufficient for his implementation, though a bad idea on a large scale parser.
All html parser modules I am aware of would not do what he is looking for since they basically parse the text inside the tags and do nothing with the actual tags themselves except demilit on them.
| [reply] |