Extracting a substring from HTML

richill has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Extracting a substring from HTML by GrandFather (Saint) on Sep 10, 2006 at 09:49 UTC
The right way of doing almost anything with HTML is to use the appropriate module. The appropriate module depends somewhat on the task. In this case I'd guess HTML::TreeBuilder is what you want. Life is too short to reinvent complicated wheels, and regexen for parsing HTML are complicated wheels indeed. If you need any help using TreeBuilder show us what you have tried with a very small (but complete) code sample showing the issue and a very small data sample as required to show the issue. DWIM is Perl's answer to Gödel	[reply]
Re^2: Extracting a substring from HTML by richill (Monk) on Sep 10, 2006 at 10:17 UTC
Thank you. I'll look at the HTML::Treebuilder now. I know it was a basic queston but with so many ways of doing things in perl, the benefit of experience found on here is high. I could spend days on clumsy solution.	[reply]
Re^3: Extracting a substring from HTML by ait (Hermit) on Sep 10, 2006 at 18:39 UTC
The benefit of experience is actually in CPAN. Always look there first before coding anything yourself. Get the Perl Cookbook (ISBN 0-596-00313-7) to get _productive_ right away with Perl, it's going to be your best spent $50 if you are going to work with Perl, and many cool ideas not only on the use of Perl but on many modules for specific stuff. Then get the what I call the trilogy: Learning Perl, Intermediate Perl, and Advanced Perl. And then, of course, the Camel Book. But that's just to say you have it and have read it.	[reply]
Re^3: Extracting a substring from HTML by Anonymous Monk on Sep 10, 2006 at 18:18 UTC
Please be careful. Package names are case-sensitive in Perl. That's HTML::TreeBuilder	[reply]
Re: Extracting a substring from HTML by graff (Chancellor) on Sep 10, 2006 at 19:20 UTC
This is an odd sort of post... there seems to be more detail in the title than there is in the text. What perl code have you tried so far? What are these "Markers"? Are they particular html tags? particular patterns of visible text when the html is displayed in a browser? chunks of javascript? Do you have lots of different html pages/files from which to extract stuff? If so, are the "Markers" different from one page/file to the next? Any or all of these things would affect one's choice of a solution. On another topic: My first though is reqular expression but is there a way to treat a string like an object in like in java This struck me as intriguing, because one of the problems I had on the few occasions when I've tried to do something using java (or python), was adjusting to the notion of applying a regex match or substitution on a "string object". It just seems bizarre and unnatural (maybe even inefficient or suboptimal in some way) that regex operations are methods built into string objects, rather than simply being operations on strings (the perlish way). I guess my primitive non-OO orientation is glaringly obvious here... In any case, figuring out how to use HTML modules will be time well spent, assuming you have a lot of work to do on HTML data. In the meantime, if you have an immediate task that simply involves capturing whatever comes between "Marker 1" and "Marker 2" in an html stream, here are some reasonable first attempts to do what you want: `use strict; my $html; open( HTML, "<", "some.html" ); # let's suppose the data is in this f +ile { local $/; $html = <HTML>; #read all the html data into one string } # if you expect just one match (or only want the first one): my ( $match ) = ( $html =~ /Marker 1(.?)Marker 2/s ); # alternatively, if there are two or more and you want them all: my @matches = ( $html =~ /Marker 1(.?)Marker 2/gs );` [download] (update: in both cases the "s" option following the regex can be important, so that the "." (wildcard) will match newlines as well as any other character). In the first case, the parens around $match provide a "list context", which will cause the regex match to return whatever string was "captured" by the match (in this case, parens within the regex say what part will be captured). In the second case, the "g" option on the regex says "find and return all captured matches"; the result is being assigned to an array, which again provides a list context for the operation. (In a scalar context, such as `$found = ( /$pattern/ )` the returned value would simply be ~~the number of matches:~~ 0 or 1 ~~without the "g" option, any non-negative integer with "g"~~ i.e. "false/failure" or "true/success".) So the main caveats with this approach (since I don't know what "Marker 1" and "Marker 2" represent) are: you might match something you didn't want, e.g. if "Marker 1" and/or "Marker 2" show up in places like html comments, html header or javascript, whereas you might just want the match to succeed on the displayable text part; when you capture a region that you do want, the text might contain stuff you can't use, e.g. incomplete pieces of nested tag structure or extra content you'd rather ignore, and "fixing" it could get dicey. (updated wording of 2nd bullet for clarity). Those are a few of the reasons why HTML parsing modules are the preferred tool in many cases -- but for a range of limited applications, simple regex matches can suffice.	[reply] [d/l] [select]
Re: Extracting a substring from HTML by mugwumpjism (Hermit) on Sep 11, 2006 at 04:13 UTC
Check out XML::LibXML for the nice ways to do this, using standards such as XPath and DOM. Regular expressions are not a very good way to parse structured input like XML, unless you can limit the input to a known subset of XML forms. $h=$ENV{HOME};my@q=split/\n\n/,`cat $h/.quotes`;$s="$h/." ."signature";$t=`cat $s`;print$t,"\n",$q[rand($#q)],"\n"; [download]	[reply] [d/l]
Re: Extracting a substring from HTML by Anonymous Monk on Sep 12, 2006 at 03:48 UTC
Check out this node: Efficiently Extracting a Range of Lines Not exactly HTML specific but definitely worth a looksie for extracting text that lies between two known 'markers'.	[reply]