Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to parse a webpage using TokeParser. This is what I have to check for some data in table tags..
elsif( ($tag eq "td") ) { my $text = $tp->get_text; print "$text\n" if($text =~ /\d{1,2}/); }
The tag contents I am looking for look like this: <td>&nbsp;##&nbsp;</td> where ## is either a single or two digit number. However, there are lots of td tags in the page that contain numbers with decimals points but without the nbsp tag. ie. <td>5.4</td>. I tried to screen the tags using the nbsp tag but that didn't work.
if($text =~ /nbsp/);
Any ideas why not? Is there any way to write the if statement above to screen for just a single or two digit number or to somehow exclude text that contains decimal points? Thanks.

Replies are listed 'Best First'.
Re: TokeParser
by LTjake (Prior) on Oct 26, 2002 at 19:58 UTC
    It seems as tho you're having the same problem with get_text as I was having. It appears that when you call the get_text method, it "massages" the contents of the text. for instance i had an item a link in it, and it removed the link and just gave me the plain text. In your case it's converting the &nbsp; to something else (i got things like á2á (WinXP, ActivePerl 5.6)). Here's a pseudo work-around:
    use HTML::TokeParser; use strict; local $/; my $lines = <DATA>; my $p = HTML::TokeParser->new(\$lines); while (my $token = $p->get_token) { print "$1\n" if ($token->[1] =~ /^&nbsp;(\d{1,2})&nbsp;$/ && $toke +n->[0] eq 'T') } __END__ <td>1</td> <td>&nbsp;2&nbsp;</td> <td>10</td> <td>&nbsp;20&nbsp;</td>
    Output:
    2 20
    Note: it will only work if you can guarantee that the data comes directly after the <td> tag (ie, no <div>, <p>, etc..)

    HTH

    Update: Better code, now. What i had was this: If you find a td tag, get the next tag, it should be text, and see if it matches the patern. Now, instead, I check to see if it's a text tag and if it matches our pattern. That should be more reliable.

    Update 2: added use strict; :P

    Update 2.5: code change thanks to Aristotle

    --
    Rock is dead. Long live paper and scissors!
      Quick note: local($/) = undef; is the same as local $/; :-)

      Makeshifts last the longest.

Re: TokeParser
by BrowserUk (Patriarch) on Oct 26, 2002 at 19:33 UTC

    I've never used TokeParser, but is there any chance that it is de-entitying the content? By which I mean, if you ask for the content of a tag that contains &nbsp; as part of the markup, does it return a space or the entity?

    If I asked for the contents of <h1>Hello&nbsp;world!</h1> I'd expect to get "Hello world!" not "Hello&nbsp;world!".

    Just speculation.


    Nah! Your thinking of Simon Templar, originally played by Roger Moore and later by Ian Ogilvy
Re: TokeParser
by graff (Chancellor) on Oct 27, 2002 at 04:55 UTC
    Is there any way to write the if statement above to screen for just a single or two digit number or to somehow exclude text that contains decimal points?

    You can exclude numbers with decimal points using something like this, which will work no matter what else is going on around the numbers in a particular text string:

    print "$text\n" if ( $text =~ /[^.\d]?\d{1,2}[^.\d]?/ );
    This way, if there is any character before and/or after a one- or two-digit number, the character must not be a period. (update: added "\d" inside each of the bounding character class specs, to make sure we don't match 3 or more digits.)
Re: TokeParser
by PodMaster (Abbot) on Oct 27, 2002 at 08:41 UTC
    {
        local $^RANT=1;
    I don't know of any module called TokeParser.
    I am however, very aware of one called HTML::TokeParser.
    If that is indeed the module you are referring to, please refer to it by it's name, so people know what you're talking about.
    If you try and say use TokeParser;, perl won't understand, so why should we?
    }

    I will now quote from the HTML::TokeParser documentation:

    
     $p->get_text( [$endtag] )
         This method returns all text found at the current position. It will
         return a zero length string if the next token is not text. The
         optional $endtag argument specifies that any text occurring before
         the given tag is to be returned. Any entities will be converted to
         their corresponding character.
    
    You can't be shooting in the dark ~ life ain't a mystery, you just have to read the manual ;D

    I don't like to speculate about what's happening. I like to know for sure.

    Gee, I wonder what'll happen if I execute rm -rf / whilst logged in as root on my linux machine? Hmmmm, should I try it out and then guess what it does? Anybody, anybody? *sigh*

    I wonder why people write documentation anymore ....

    ____________________________________________________
    ** The Third rule of perl club is a statement of fact: pod is sexy.

      perl won't understand, so why should we?

      That's the difference between computers and the human brain - free association of idea's and concepts.

      I don't like to speculate about what's happening. I like to know for sure

      Thanks for confirming my speculation.

      I did read the documentation which is how I knew the answer. I just felt that by putting the onus upon the OP to confirm the 'speculation' by reading it himself might prove beneficial.


      Nah! Your thinking of Simon Templar, originally played by Roger Moore and later by Ian Ogilvy
      Telling somebody to read the docs can be done with kindness and finesse. Remember that you haven't always been as experienced as you are today. Some people are coding without even knowing where the docs are, and are doing the best they can at it.

      One aim of this site (as I understand it) is to be welcoming to everyone; that ought to include people who haven't read the documentation. Should we make someone who's learning Perl for the first time hesitant to post here for fear of being lashed because they forgot, or didn't know where to look? Even if this person had the docs available, someone reading and considering asking a question might not. We need to be considerate to them, as well.
      --

      Love justice; desire mercy.