annunaki10 has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to parse an HTML file and it's driving me crazy. In one page there are six blocks that follow this format.

<p..><some more tags> --- TITLE --- WORDS WORDS WORDS WORDS WORDS WORDS<BR>
<object...><a lot more tags></object>
</p>

WORDS section runs onto multiple lines. The code within the object tags run multiple lines as well. I'd like to grab the title, words, and object code of each instance on a page. The html is being stored in a scalar.

# This correctly gets all of the titles
@titles = $page =~ /--- (.*) ---/gi;


# This only grabs the last set of WORDS
@words = $page =~ /--- .* --- (.+)<BR>/gis;


# This grabs everything after the first <object and doesn't stop at the close object tag
@objects= $page =~ /(<object.*>.*<\/object>)/gis;

I'm pretty sure it has something to do with the /s modifier. I read that /s ignores newlines and without them, i get nothing. However i'm still clearly doing something wrong. Any help would be greatly appreciated.

Replies are listed 'Best First'.
Re: Trying to parse html file
by GrandFather (Saint) on Jun 18, 2009 at 19:43 UTC

    Don't do that! Unless you enjoy poking yourself in the eye with a sharp stick and other such delightful activities, don't parse HTML using regexen. instead use HTML::TreeBuilder or one of the other HTML related modules - there are many.

    I'd also strongly recommend that you always use strictures (use strict; use warnings;). my missing from the start of your sample lines indicates that either you don't, or that you are declaring your variables in too large a scope. It's that sharp stick thing again. ;)


    True laziness is hard work
Re: Trying to parse html file
by mirod (Canon) on Jun 18, 2009 at 19:45 UTC

    How about using the proper tool for the job, like an HTML parser: HTML::Parser. Actually, I like HTML::TreeBuilder myself, it's a layer above HTML::Parser that makes it even easier to process the HTML. This way you don't have to bother about multi lines content, comments, weird HTML markup and the likes.

Re: Trying to parse html file
by graff (Chancellor) on Jun 19, 2009 at 06:54 UTC
    <p..><some more tags> --- TITLE --- WORDS WORDS WORDS WORDS WORDS WORD +S<BR> <object...><a lot more tags></object> </p>
    WORDS section runs onto multiple lines. The code within the object tags run multiple lines as well. I'd like to grab the title, words, and object code of each instance on a page. The html is being stored in a scalar.

    That's enough info for working out a basic solution with HTML::Parser. (There's still the matter of what you need to do with these pieces once you have them, but that's the easy part, right?)

    It really helps to have valid html data as input for trying this out, so I made a few changes to the example you gave. Okay, this code is a lot longer than a couple regexes, but it will work, reliably, for any valid html data that resembles your example.

    use strict; use HTML::Parser; my $html = <<EOH; <p foo="bar"><some more tags> --- TITLE --- WORDS WORDS WORDS WORDS WORDS WORDS<BR> <object att="val"> <and> <more> <tags/> </more> </and></object> </p> EOH my ( $partext, $objtext ); my ( @titlewords, @objects ); my $inpar = my $inobj = 0; my $parser = new HTML::Parser( api_version => 3, start_h => [ \&handle_starttag, "tagnam +e,text" ], text_h => [ \&handle_text, "dtext" ], end_h => [ \&handle_endtag, "tagname,te +xt" ] ); $parser->parse( $html ); for my $t ( @titlewords ) { print "=== Found title and words: ===\n$t\n======\n"; } for my $o ( @objects ) { print "=== Found object: ===\n$o\n======\n"; } sub handle_starttag { my ( $tag, $text ) = @_; if ( $tag eq 'p' ) { $inpar = 1; $partext = ''; } elsif ( $tag eq 'object' ) { $inobj = 1; $objtext = ''; } elsif ( $tag eq 'br' and $inpar ) { push @titlewords, $partext if ( $partext =~ /-+ TITLE -+/ ); $inpar = 0; } elsif ( $inobj ) { $objtext .= $text; } } sub handle_text { my ( $text ) = @_; if ( $inpar ) { $partext .= $text; } elsif ( $inobj ) { $objtext .= $text; } } sub handle_endtag { my ( $tag, $text ) = @_; if ( $tag eq 'object' ) { push @objects, $objtext; $inobj = 0; } elsif ( $inobj ) { $objtext .= $text; } }

    Look at the man page for HTML::Parser to understand how the "new" call works. The rest is just a matter of figuring out how to handle the data contents, as the parser encounters the relevant tag and text events.

    (Updated the code to check for "TITLE" in the text when there's a "br" tag, and added line-breaks in the html to show how those would be handled.)

Re: Trying to parse html file
by wfsp (Abbot) on Jun 19, 2009 at 07:08 UTC
    #! /usr/bin/perl use strict; use warnings; use HTML::TreeBuilder; use Data::Dumper; $Data::Dumper::Indent = 2; my $html = do{local $/;<DATA>}; my $t = HTML::TreeBuilder->new_from_content($html) or die qq{cant parse html: $!\n}; # get a list of p tags my @paras = $t->look_down(_tag => q{p}); my @blocks; for my $para (@paras){ my $txt; # skip any of the tags at the start of the p tag, # collect the first text found # stop if a br tag found for my $item ($para->content_refs_list){ if (ref $$item){ # we have a tag my $tag = $$item->tag; last if $tag eq q{br}; next; } # we have text $txt = $$item; } # is it the p tag we need? my ($title, $words); next unless ($title, $words) = $txt =~ /\s---\s(.*?)\s---\s(.*)/; # look down the p tag for the object my $object = $para->look_down(_tag => q{object}) or die qq{look down didnt find object}; # stuff what we've found into a table push @blocks, { title => $title, words => $words, object => $object->as_HTML(undef, q{ }, {}), } } print Dumper \@blocks; __DATA__ <html><head><title>six blocks</title></head><body> <p>text we don&lsquo;t want</p> <p>text we don&lsquo;t want</p> <p id="block1"><img src="pic.jpg"><a href="link.html">link</a> --- TIT +LE1 --- WORDS1 WORDS1 WORDS1 WORDS1 WORDS1 WORDS1<BR> <object id="object1"><param>If you can read this you are too close.</o +bject> </p> <p id="block2"><img src="pic.jpg"><a href="link.html">link</a> --- TIT +LE2 --- WORDS2 WORDS2 WORDS2 WORDS2 WORDS2 WORDS2<BR> <object id="object2"><param>If you can read this you are too close.</o +bject> </p> <p id="block3"><img src="pic.jpg"><a href="link.html">link</a> --- TIT +LE3 --- WORDS3 WORDS3 WORDS3 WORDS3 WORDS3 WORDS3<BR> <object id="object3"><param>If you can read this you are too close.</o +bject> </p> <p>text we don&lsquo;t want</p> <p id="block4"><img src="pic.jpg"><a href="link.html">link</a> --- TIT +LE4 --- WORDS4 WORDS4 WORDS4 WORDS4 WORDS4 WORDS4<BR> <object id="object4"><param>If you can read this you are too close.</o +bject> </p> <p id="block5"><img src="pic.jpg"><a href="link.html">link</a> --- TIT +LE5 --- WORDS5 WORDS5 WORDS5 WORDS5 WORDS5 WORDS5<BR> <object id="object5"><param>If you can read this you are too close.</o +bject> </p> <p id="block6"><img src="pic.jpg"><a href="link.html">link</a> --- TIT +LE6 --- WORDS6 WORDS6 WORDS6 WORDS6 WORDS6 WORDS6<BR> <object id="object6"><param>If you can read this you are too close.</o +bject> </p> <p>text we don&lsquo;t want</p> <p>text we don&lsquo;t want</p> </body></html>
    $VAR1 = [ { 'object' => '<object id="object1"> <param />If you can read this you are too close.</object> ', 'title' => 'TITLE1', 'words' => 'WORDS1 WORDS1 WORDS1 WORDS1 WORDS1 WORDS1' }, { 'object' => '<object id="object2"> <param />If you can read this you are too close.</object> ', 'title' => 'TITLE2', 'words' => 'WORDS2 WORDS2 WORDS2 WORDS2 WORDS2 WORDS2' }, { 'object' => '<object id="object3"> <param />If you can read this you are too close.</object> ', 'title' => 'TITLE3', 'words' => 'WORDS3 WORDS3 WORDS3 WORDS3 WORDS3 WORDS3' }, { 'object' => '<object id="object4"> <param />If you can read this you are too close.</object> ', 'title' => 'TITLE4', 'words' => 'WORDS4 WORDS4 WORDS4 WORDS4 WORDS4 WORDS4' }, { 'object' => '<object id="object5"> <param />If you can read this you are too close.</object> ', 'title' => 'TITLE5', 'words' => 'WORDS5 WORDS5 WORDS5 WORDS5 WORDS5 WORDS5' }, { 'object' => '<object id="object6"> <param />If you can read this you are too close.</object> ', 'title' => 'TITLE6', 'words' => 'WORDS6 WORDS6 WORDS6 WORDS6 WORDS6 WORDS6' } ];
    btw <object.*> is greedy but that is only the start of your problems. :-)

    update: tweaked the html
    update2: removed module that isn't used

Re: Trying to parse html file
by annunaki10 (Initiate) on Jun 18, 2009 at 20:07 UTC
    i've already installed it and am trying to figure it out... looking for some decent documentation

      If you post a small sample of the data you want to parse and describe what you want to extract from it we can give you a hand up.


      True laziness is hard work
Re: Trying to parse html file
by stevemayes (Scribe) on Jun 19, 2009 at 09:23 UTC

    perlfaq6 "How can I pull out lines between two patterns that are themselves on different lines?"

    use the .. operator /START/ .. /END/

    I used it to pull out data in numerous places across multiple lines between two patterns in a similar way to what you are indicating.

    caveat: I'm very new at perl - there may be much better ways of doing it as outlined above; I'm just very simplistic in my approach and this was a simple solution. I also could have misunderstood what you are asking.

A reply falls below the community's threshold of quality. You may see it by logging in.