Re: HTML parsing OR capturing text from a string within tags

Might I suggest a differnt tact than you're taking now?

Long ago, I wrote a newspaper headline grabber for a Perl class using LWP::Simple's get function to grab web pages. I found that easier to use since it can return the whole page to a scalar. Then I used HTML::TokeParser to actually divide up the information and based my collection on only the tokens I actually wanted to save.

If you look at Re: HTML::TokeParser help - parsing headlines there's a quick and dirty token parser that I wrote so that you can see how it splits up an HTML file.

Hope that helps!

Revolution. Today, 3 O'Clock. Meet behind the monkey bars.

If quizzes are quizzical, what are tests?

Comment on Re: HTML parsing OR capturing text from a string within tags

Replies are listed 'Best First'.
Re^2: HTML parsing OR capturing text from a string within tags by kevyt (Scribe) on Dec 24, 2006 at 07:09 UTC
Popcorn Dave, Thanks... I will try that... I just added a lot of prints to Element.pm to see what is going on. I will try your method tomorrow :) Thanks... This is what I have done. The format of Element.pm looks similar to code I use to work with at a former job. sub find_by_ktag_name { my(@pile) = shift(@_); # start out the to-do stack for the traverser Carp::croak "find_by_created_tag_name can be called only as an objec +t method" unless ref $pile[0]; return() unless @_; print "pile is @pile\n"; my(@tags) = $pile[0]->_fold_case(@_); print "tags are @tags\n"; my(@matching, $this, $this_tag); while(@pile) { $this_tag = ($this = shift @pile)->{'_tag'}; print "In while loop. this_tag is $this_tag\n"; foreach my $t (@tags) { print "foreach going through elements of tag. Elements are t an +d t is $t\n"; print "next step will check to see if t is eq to this_tag. this_ +tag is $this_tag\n"; if($t eq $this_tag) { print "inside of if... t and this_tag are equal.\n"; if(wantarray) { print "I am here if wantarray is true. Now push this onto +array matching\n"; push @matching, $this; print "matching is @matching\n"; last; } else { print "wantarray not true, returning this $this\n"; return $this; } } } unshift @pile, grep ref($_), @{$this->{'_content'} \|\| next}; } print "returning @matching if wantarray\n"; return @matching if wantarray; return; } [download] My print statements showed me that there is a library of predefined tags. If I can add my own tags, I think it will work :) I will also try your method. Tackling this is sort of fun. some output: `next step will check to see if t is eq to this_tag. this_tag is a In while loop. this_tag is a next step will check to see if t is eq to this_tag. this_tag is font next step will check to see if t is eq to this_tag. this_tag is br` [download]	[reply] [d/l] [select]
Re^2: HTML parsing OR capturing text from a string within tags by kevyt (Scribe) on Dec 24, 2006 at 07:31 UTC
Popcorn Dave, I looked at your code. I dont know how it works yet. Will it allow me to add my own string and remove the text right after it. For exmaple... `<div\042\... > Person <b> Ran <\div>` [download] will it allow me to capture Person Ran? I think this is the file where I can add my own tags :) `HTML-Tree-3.23/lib/HTML/AsSubs.pm` [download]	[reply] [d/l] [select]
Re^3: HTML parsing OR capturing text from a string within tags by Popcorn Dave (Abbot) on Dec 24, 2006 at 09:12 UTC
All that code does is get a html page and parse it in to tokens. It will spit the whole mess out, so I ran it at command line, e.g. perl tokeparser.pl > output.txt That way you can scan through the file and see how it's tokenizing the information you fed it. Revolution. Today, 3 O'Clock. Meet behind the monkey bars. If quizzes are quizzical, what are tests?	[reply]
Re^4: HTML parsing OR capturing text from a string within tags by kevyt (Scribe) on Jan 02, 2007 at 17:44 UTC
Yahoo offers something that I can use. I can send yahoo a request and yahoo will send me a xml file BUT I am getting errors because yahoo has urls with &'s in the file. I can either replace all of the & with %26 and save the file and then let the XML::Parser do the work or I can look at the Parser code and determine where it parses the file and make the change there. I am found where it parses the file in Expat.pm :: sub parse. Then it calls ParseString() but I cant find the sub ParseString. `http://local.yahooapis.com/LocalSearchService/V2/localSearch?appid=YahooDemo&query=plumbing&zip=22222&format=php&results=10` Kevin	[reply] [d/l]
Re^5: HTML parsing OR capturing text from a string within tags by Popcorn Dave (Abbot) on Jan 02, 2007 at 18:43 UTC
Re^6: HTML parsing OR capturing text from a string within tags by kevyt (Scribe) on Jan 04, 2007 at 18:04 UTC