Pulling the content of HTML tags

markguy has asked for the wisdom of the Perl Monks concerning the following question:

To clarify something that seems to be used differently in some of the HTML::* modules I looked at, by "content" I mean whatever falls between the start and end tags.

I thought something like HTML::TreeBuilder was the way to go, but of course that just displays the tags and their attributes. Which is a fine thing, but not what I need. This is more along the lines of parsing XML files, but with the added difficulty of the files I'm running this are pretty darn far from valid XML (although I'm working with some XML-like tags), which means trying to use XML::Parser causes choking verrrry early on.

I've also tried a regexp approach, but I ran into difficulties with nested tags and my need to decide what to do with the content immediately upon finding it.

Simple example of this:

<research_type>Report</content_type>
<research_title>Something <i>Inane</i></research_title>
[download]

I'm looking to get something along these lines back:

research_type : Report<br>
research_title : Something <i>Inane</i>
i : Inane
[download]

All ideas and comments appreciated.

Comment on Pulling the content of HTML tags Select or Download Code

Replies are listed 'Best First'.
Re: Pulling the content of HTML tags by maverick (Curate) on Jun 28, 2000 at 20:00 UTC
There's a module called 'HTML::TokeParser' that will break a html file into seperate tags. Then you can use a simple 'while and if' to do whatever you want with the text elements. /\/\averick	[reply]
Re: Pulling the content of HTML tags by swiftone (Curate) on Jun 28, 2000 at 19:35 UTC
Can't you do this with HTML::Parser, and a flag stating when you are and aren't "inside" a tag? Have your start() routine set/increment the flag, and the end() routine unset/decrement it, and the test() routine trap it or not based on the flag. You'd have to watch for and deal with nested tags, but I don't see why it wouldn't work.	[reply]
Re: Pulling the content of HTML tags by jlistf (Monk) on Jun 28, 2000 at 20:15 UTC
something to try if none of the built in methods work out - use recursion with regex's. something like: `sub find_tags { $tag = $_[0]; $tag =~ m/<(.?)>(.?)<.*?>/ print "$1 : $2"; if ($2 =~ m/</) {find_tags($2);} }` [download] this might need some debugging... you also probably have to worry about finding quotes within the tags as well... but this is the basic idea.	[reply] [d/l]
Re: Pulling the content of HTML tags by ZZamboni (Curate) on Jun 28, 2000 at 22:39 UTC
I think HTML::Parser may be just what you need. I haven't used it, but there is a pretty good article about it in the current issue of The Perl Journal It looked very easy to use. --ZZamboni	[reply]