Trying to parse html file

annunaki10 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Trying to parse html file by GrandFather (Saint) on Jun 18, 2009 at 19:43 UTC
Don't do that! Unless you enjoy poking yourself in the eye with a sharp stick and other such delightful activities, don't parse HTML using regexen. instead use HTML::TreeBuilder or one of the other HTML related modules - there are many. I'd also strongly recommend that you always use strictures (use strict; use warnings;). my missing from the start of your sample lines indicates that either you don't, or that you are declaring your variables in too large a scope. It's that sharp stick thing again. ;) True laziness is hard work	[reply]
Re: Trying to parse html file by mirod (Canon) on Jun 18, 2009 at 19:45 UTC
How about using the proper tool for the job, like an HTML parser: HTML::Parser. Actually, I like HTML::TreeBuilder myself, it's a layer above HTML::Parser that makes it even easier to process the HTML. This way you don't have to bother about multi lines content, comments, weird HTML markup and the likes.	[reply]
Re: Trying to parse html file by graff (Chancellor) on Jun 19, 2009 at 06:54 UTC
`<p..><some more tags> --- TITLE --- WORDS WORDS WORDS WORDS WORDS WORD +S<BR> <object...><a lot more tags></object> </p>` [download] WORDS section runs onto multiple lines. The code within the object tags run multiple lines as well. I'd like to grab the title, words, and object code of each instance on a page. The html is being stored in a scalar. That's enough info for working out a basic solution with HTML::Parser. (There's still the matter of what you need to do with these pieces once you have them, but that's the easy part, right?) It really helps to have valid html data as input for trying this out, so I made a few changes to the example you gave. Okay, this code is a lot longer than a couple regexes, but it will work, reliably, for any valid html data that resembles your example. use strict; use HTML::Parser; my $html = <<EOH; <p foo="bar"><some more tags> --- TITLE --- WORDS WORDS WORDS WORDS WORDS WORDS<BR> <object att="val"> <and> <more> <tags/> </more> </and></object> </p> EOH my ( $partext, $objtext ); my ( @titlewords, @objects ); my $inpar = my $inobj = 0; my $parser = new HTML::Parser( api_version => 3, start_h => [ \&handle_starttag, "tagnam +e,text" ], text_h => [ \&handle_text, "dtext" ], end_h => [ \&handle_endtag, "tagname,te +xt" ] ); $parser->parse( $html ); for my $t ( @titlewords ) { print "=== Found title and words: ===\n$t\n======\n"; } for my $o ( @objects ) { print "=== Found object: ===\n$o\n======\n"; } sub handle_starttag { my ( $tag, $text ) = @_; if ( $tag eq 'p' ) { $inpar = 1; $partext = ''; } elsif ( $tag eq 'object' ) { $inobj = 1; $objtext = ''; } elsif ( $tag eq 'br' and $inpar ) { push @titlewords, $partext if ( $partext =~ /-+ TITLE -+/ ); $inpar = 0; } elsif ( $inobj ) { $objtext .= $text; } } sub handle_text { my ( $text ) = @_; if ( $inpar ) { $partext .= $text; } elsif ( $inobj ) { $objtext .= $text; } } sub handle_endtag { my ( $tag, $text ) = @_; if ( $tag eq 'object' ) { push @objects, $objtext; $inobj = 0; } elsif ( $inobj ) { $objtext .= $text; } } [download] Look at the man page for HTML::Parser to understand how the "new" call works. The rest is just a matter of figuring out how to handle the data contents, as the parser encounters the relevant tag and text events. (Updated the code to check for "TITLE" in the text when there's a "br" tag, and added line-breaks in the html to show how those would be handled.)	[reply] [d/l] [select]
Re: Trying to parse html file by wfsp (Abbot) on Jun 19, 2009 at 07:08 UTC
#! /usr/bin/perl use strict; use warnings; use HTML::TreeBuilder; use Data::Dumper; $Data::Dumper::Indent = 2; my $html = do{local $/;<DATA>}; my $t = HTML::TreeBuilder->new_from_content($html) or die qq{cant parse html: $!\n}; # get a list of p tags my @paras = $t->look_down(_tag => q{p}); my @blocks; for my $para (@paras){ my $txt; # skip any of the tags at the start of the p tag, # collect the first text found # stop if a br tag found for my $item ($para->content_refs_list){ if (ref $$item){ # we have a tag my $tag = $$item->tag; last if $tag eq q{br}; next; } # we have text $txt = $$item; } # is it the p tag we need? my ($title, $words); next unless ($title, $words) = $txt =~ /\s---\s(.?)\s---\s(.)/; # look down the p tag for the object my $object = $para->look_down(_tag => q{object}) or die qq{look down didnt find object}; # stuff what we've found into a table push @blocks, { title => $title, words => $words, object => $object->as_HTML(undef, q{ }, {}), } } print Dumper \@blocks; __DATA__ <html><head><title>six blocks</title></head><body> <p>text we don‘t want</p> <p>text we don‘t want</p> <p id="block1"><img src="pic.jpg"><a href="link.html">link</a> --- TIT +LE1 --- WORDS1 WORDS1 WORDS1 WORDS1 WORDS1 WORDS1<BR> <object id="object1"><param>If you can read this you are too close.</o +bject> </p> <p id="block2"><img src="pic.jpg"><a href="link.html">link</a> --- TIT +LE2 --- WORDS2 WORDS2 WORDS2 WORDS2 WORDS2 WORDS2<BR> <object id="object2"><param>If you can read this you are too close.</o +bject> </p> <p id="block3"><img src="pic.jpg"><a href="link.html">link</a> --- TIT +LE3 --- WORDS3 WORDS3 WORDS3 WORDS3 WORDS3 WORDS3<BR> <object id="object3"><param>If you can read this you are too close.</o +bject> </p> <p>text we don‘t want</p> <p id="block4"><img src="pic.jpg"><a href="link.html">link</a> --- TIT +LE4 --- WORDS4 WORDS4 WORDS4 WORDS4 WORDS4 WORDS4<BR> <object id="object4"><param>If you can read this you are too close.</o +bject> </p> <p id="block5"><img src="pic.jpg"><a href="link.html">link</a> --- TIT +LE5 --- WORDS5 WORDS5 WORDS5 WORDS5 WORDS5 WORDS5<BR> <object id="object5"><param>If you can read this you are too close.</o +bject> </p> <p id="block6"><img src="pic.jpg"><a href="link.html">link</a> --- TIT +LE6 --- WORDS6 WORDS6 WORDS6 WORDS6 WORDS6 WORDS6<BR> <object id="object6"><param>If you can read this you are too close.</o +bject> </p> <p>text we don‘t want</p> <p>text we don‘t want</p> </body></html> [download] $VAR1 = [ { 'object' => '<object id="object1"> <param />If you can read this you are too close.</object> ', 'title' => 'TITLE1', 'words' => 'WORDS1 WORDS1 WORDS1 WORDS1 WORDS1 WORDS1' }, { 'object' => '<object id="object2"> <param />If you can read this you are too close.</object> ', 'title' => 'TITLE2', 'words' => 'WORDS2 WORDS2 WORDS2 WORDS2 WORDS2 WORDS2' }, { 'object' => '<object id="object3"> <param />If you can read this you are too close.</object> ', 'title' => 'TITLE3', 'words' => 'WORDS3 WORDS3 WORDS3 WORDS3 WORDS3 WORDS3' }, { 'object' => '<object id="object4"> <param />If you can read this you are too close.</object> ', 'title' => 'TITLE4', 'words' => 'WORDS4 WORDS4 WORDS4 WORDS4 WORDS4 WORDS4' }, { 'object' => '<object id="object5"> <param />If you can read this you are too close.</object> ', 'title' => 'TITLE5', 'words' => 'WORDS5 WORDS5 WORDS5 WORDS5 WORDS5 WORDS5' }, { 'object' => '<object id="object6"> <param />If you can read this you are too close.</object> ', 'title' => 'TITLE6', 'words' => 'WORDS6 WORDS6 WORDS6 WORDS6 WORDS6 WORDS6' } ]; [download] btw `<object.*>` is greedy but that is only the start of your problems. :-) update: tweaked the html update2: removed module that isn't used	[reply] [d/l] [select]
Re: Trying to parse html file by annunaki10 (Initiate) on Jun 18, 2009 at 20:07 UTC
i've already installed it and am trying to figure it out... looking for some decent documentation	[reply]
Re^2: Trying to parse html file by Anonymous Monk on Jun 18, 2009 at 20:36 UTC
HTML::Tree(Builder) in 6 minutes	[reply]
Re^2: Trying to parse html file by GrandFather (Saint) on Jun 18, 2009 at 21:26 UTC
If you post a small sample of the data you want to parse and describe what you want to extract from it we can give you a hand up. True laziness is hard work	[reply]
Re: Trying to parse html file by stevemayes (Scribe) on Jun 19, 2009 at 09:23 UTC
perlfaq6 "How can I pull out lines between two patterns that are themselves on different lines?" use the .. operator `/START/ .. /END/` I used it to pull out data in numerous places across multiple lines between two patterns in a similar way to what you are indicating. caveat: I'm very new at perl - there may be much better ways of doing it as outlined above; I'm just very simplistic in my approach and this was a simple solution. I also could have misunderstood what you are asking.	[reply] [d/l]
A reply falls below the community's threshold of quality. You may see it by logging in.