parsing HTML

eod has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: parsing HTML by davorg (Chancellor) on Sep 14, 2001 at 14:10 UTC
`tr` returns the number of translations it has done - so you can use code like: `$count = $text =~ tr/"/"/;` to get the number of double quotes in `$text`. I can't help wondering, however, why you're trying to reinvent HTML::Parser. What does that module not do that you need? -- <http://www.dave.org.uk> Perl Training in the UK <http://www.iterative-software.com>	[reply] [d/l]
Re: Re: parsing HTML by eod (Acolyte) on Sep 14, 2001 at 15:30 UTC
I reinvent the wheel (that's true :) ), this way I learn how to make one. Perhaps is a stupid idea, but I'm learning a lot about parsers and its problems and it's interesting. Anyway, the real reason is that I'm working in Windows with Savant Web Server, my version doesn't have any HTML libs. I like to run my application in any Web Server, and the parser I'm doing is a very simple one (not a complete HTML Parser). So I can permit this "re-invention". I knew the question was simple, but I really didn't know how to solve it. And you are great, Perl Monks. Thanks very much!! --edu	[reply]
Re: Re: parsing HTML by Rhandom (Curate) on Sep 14, 2001 at 20:00 UTC
On a trivial note about counting occurances of letters you can do it all of these ways: `my $txt = "1,2,3,4,5,6"; my $n1 = scalar(split(/,/,$txt))-1; my $n2 = scalar(@{[ split(/,/,$txt) ]})-1; my $n3 = scalar($#{[ split(/,/,$txt) ]})-1; my $n4 = (y/,/,/); my $n5 = (m/,/g);` [download] But the last is by far the fastest because it doesn't do any string modification. It's also the best for golf. my @a=qw(random brilliant braindead); print $a[rand(@a)];	[reply] [d/l]
(crazyinsomniac) Re: parsing HTML by crazyinsomniac (Prior) on Sep 14, 2001 at 14:17 UTC
*So, the problem is how can I count the number of time a caracter appears in a string.* There is plenty of way, one of which is using y aka tr `printf "\" appears %s times in this string\n", ($string)=~ y/"/"/;`. For more info see the Categorized Questions and Answers. This however, seems to be the least of your troubles. Who cares what lang HTML::Parser is implemented in? Are you writing a parser? Here's some code (i think you have a quesiton beyond "how to count the number of times a character appears in a string", but I dunno what it is): #!/usr/bin/perl -w use strict; use HTML::TokeParser; my $string = qq, clear text <tag var1=".." var2="..>.." > clear text ,; printf "\" appears %s times in this string\n", ($string)=~ y/"/"/; my $p = HTML::TokeParser->new(\$string); die $! unless $p; while (my $token = $p->get_token) { # ["S", $tag, %$attr, @$attrseq, $origtext] if( ($token->[0] eq 'S') ) # is it a starting link tag { my ($typeotag, $tag, $attr, $attrseq, $origtext)=@$token; print "this start tag is ($tag)\n"; print "its attribues are (in original sequence):\n"; printf "(%s)=(%s)\n",$_,$attr->{$_} for(@$attrseq); print "\n\n-- a start tag no more --\n\n"; } else { printf "something else (%s)\n\t\t(%s)\n", @{$token}; } } __END__ F:\dev\HTML_Tokeparser_Tutorial>perl liar.pl " appears 4 times in this string something else (T) ( clear text ) this start tag is (tag) its attribues are (in original sequence): (var1)=(..) (var2)=(..>..) -- a start tag no more -- something else (T) ( clear text) something else (T) ( ) F:\dev\HTML_Tokeparser_Tutorial> [download] ___crazyinsomniac_______________________________________ `Disclaimer: Don't blame. It came from inside the void` `perl -e "$q=$_;map({chr unpack qq;H;,$_}split(q;;,qH*));print;$q/$q;"`	[reply] [d/l] [select]
(tye)Re: parsing HTML by tye (Sage) on Sep 14, 2001 at 20:22 UTC
`m#<([^"'>]+\|"[^"]"\|'[^']')>#` will match (if I'm not missing something) properly formed HTML/XML tags (as well as other similar things, which is okay because such other similar things shouldn't* be found in proper HTML/XML). To return the matched tag in $1, use (or put the parens inside the `<>` as I do below since that part of the tag is constant): `m#(<(?:[^"'>]+\|"[^"]"\|'[^']')>)#` Note that this doesn't handle HTML comments. To properly handle both HTML comments and HTML tags using regexes, you have to march along the text like a real parser: while( $html !~ m#\G$#gc ) { if( $html =~ m#\G([^&<]+)#gc ) { # $1 is plain text } elsif( $html =~ m#\G<!--(.?)-->#gc ) { # $1 is a comment # I think HTML comments are defined by the standard # to actually be more complex than that, but the # practical definition appears to match the above. } elsif( $html =~ m#\G<((?:[^"'>]+\|"[^"]"\|'[^']'))>#gc ) { # $1 is the inside of a tag } elsif( $html =~ m#\G&(\w+);#gc ) { # $1 is the name of an entity } elsif( $html =~ m!\G&#(\w+);!gc ) { # $1 is the number of an entity } else { # We have hit invalid HTML. # You can try to be lenient here if you like: if( $html =~ m#\G([&<])#gc ) { # Treat like a & or < if you like } else { die "Impossible??"; } } } [download] I almost find it hard to believe that someone writing a module to parse HTML using just Perl didn't get that much correct... or maybe I can. (: Updated*, but not much. - tye (but my friends call me "Tye")	[reply] [d/l] [select]
Re: parsing HTML by Rhandom (Curate) on Sep 14, 2001 at 19:54 UTC
If all you want to do is find a specific tag in html that is well formed or not, you can try the following. I'm not saying to not use one of the standard parsers, its just if you are looking for a simple tag, this can do it without the overhead. It should match tags like `<img src="my\"stuff"> <img src="" width="" height="" > <img bareword> <img src="duh>">` [download] Notice that it doesn't account for bar groups like `<img>` or bare xml tags such as `<br/>`. That could be worked in. We've been using it for well over a year on html from over 1,000,000 different people on live parsed documents. I've changed it slightly to make it more relevent to the situtation. You can do what ever you want to the text in the tag_handler routine. The variable $matches will have the total number of swaps that occured. `my $txt = "Some long html document"; my $tag = "img" my $matches = ( $txt =~ s% (<\Q$tag\E\s+ # begin with tag and space (?: \w+ # key/bareword (?: =(["']?) # begin with quote or not (?:\|.*?[^\\]) # some value that doesn't end with \ \2 # close quote maybe )? # possible bareword (?:\s+\|>) # something trailing (force match) )+ # multiple groups >?) # trailing > (handles <$tag word=val >) %&tag_handler($1)%gexis );` [download] After this it is up to you to parse the tag itself. my @a=qw(random brilliant braindead); print $a[rand(@a)];	[reply] [d/l] [select]