in reply to Ignoring specific html tags before parsing

but could not find a way to specifically ignore certain tags. I

Well, XPath helps you only select what you want so if you want to ignore something, its simple to do, simply don't select it to begin with

  • Comment on Re: Ignoring specific html tags before parsing

Replies are listed 'Best First'.
Re^2: Ignoring specific html tags before parsing
by ganeshPerlStarter (Novice) on Oct 07, 2013 at 06:55 UTC
    >>its simple to do, simply don't select it to begin with then in that case, we need to list ALL those tags we're interested in. won't this endup in a long list? HTML::Parser has a method ignore_tags() which could be used to ignore tags. I used it as below & tried to get the text, but it returned many nested arrays. I could not figure out how to access to final extracted text from this "@array"
    my @array; my $p = HTML::Parser->new(api_version => 3, handlers => { text => [\@array, "text"]}); $p->ignore_tags(qw(table img)); $p->parse($page); print "Size of array=$#array\n"; foreach my $aline (@array) { print $aline; } print "\n";
    Meanwhile, I found an alternative, but seems it is quite slower than what we could have achieved with HTML::Parser.
    my $link = 'somelinek'; my $page = get($link) or die $!; my $stream = HTML::TokeParser->new(\$page); my $doparse = 1; ## 0 means don't parse while (my $token = $stream->get_token) { if ($token->[0] eq 'S') { if ($token->[1] eq 'table') { $doparse = 0; } elsif ($token->[1] eq 'img') { ;; } } elsif ($token->[0] eq 'E' and $token->[1] eq 'table') { $doparse = 1; } elsif ($token->[0] eq 'C') { ;; } elsif ($token->[0] eq 'T' and $doparse eq 1) { # text process the text in $token->[1] # skip: empty lines, " " if (defined ($token->[1])) { $token->[1] =~ s/ / /ig; $token->[1] =~ s/’/'/ig; $token->[1] =~ s/&#14[7-8];/"/ig; $token->[1] =~ s/—//ig; $token->[1] =~ s/&/&/ig; $token->[1] =~ s/-{2,}//ig; print "$token->[1]"; } } }
    This above use of TokeParser gives lot of broken text. Which could be better way? Thanks
      What is your actual goal?
        >>What is your actual goal? I want to ignore tables and img tags from html when parsing and getting embedded text from the html files.

      then in that case, we need to list ALL those tags we're interested in.

      Or you could select the ones you want and [id://1052072remove them from the tree]

        select the ones you don't want and remove them from tree with delete

        Thats it, 'm done for tonight