coldfingertips has asked for the wisdom of the Perl Monks concerning the following question:

If there were any hair left on my 20 year old head, I'd still be pulling them out...good thing for me I pulled the last hairs out last night over this!

I'm parsing a page that looks:

Chat at Allpoetry.com 400 MB hosting with 5 GB traffic for $4.95 per m +onth. Domain registration for $9.95. Hello. Login or Register? PoemsP +oetsContestsColumnsClassAdd linesFunBulletinStoreHelp Chatter Archive +s< Previous Chatter | Next Chatter >malevolent angel: if you +look closer, there are more than mods not doing anything in this chat +terbox.....hell, i use to not even come near to it lol (11 minute +s ago) demonwithin: ~coulda sworn he was drinking from goddess (11 minute +s ago) sprkls926: *hits him in the side* get off (11 minutes ago) JurneesRainbow: check out contest: the enviroment!!!!!!!!!!! (11 m +inutes ago) Foretold-Events: Well I wonder about those things hehe (12 minutes + ago) sprkls926: hey (12 minutes ago) < Previous Chatter | Next Chatter > Featured Thank Youby Rube +eIf Not For Youby CinaraSweet Rain (for Rubee) by WolfbaneStarbuck colonicby Barbara DavidsonBorrowed Bracelet (hope +you don't mind) by mystysaintmanage featured Chatterbox sprkls926: *kicks him again*de +monwithin: ~pulls out the blade and flings it to the floor splatterin +g sparkls with his blood~ take it up and do your worstmalevolent ange +l: *enjoys the show from the shadows*demonwithin: ~spreads his arms l +ike hes cruxified~sprkls926: *trys to back away from him*demonwithin: + strike if you may sprkls Online [101] ForgottenAn.. mystysaint symit +ar Zez 216 visitorsshow all A network of sharing: All Poetry, Story W +rite, All Philosophy, Old Poetry.
It took a day but finally I was able to line break where I needed to (after each set of ( .. ). But as you can see, there are things before and after the nice line breaks (the "Chat at Allpoetry.com 400 MB hosting with 5 GB traffic for $4.95 per month" lines and "< Previous Chatter | Next ". I tried using:
for my $lines (@lines ) { if ($lines =~ s/\)/\)<br>/g) { push @good,$lines; } } @lines = @good; print "@good";
Hoping to remove ANYTHING my line breaks didn't alter or find, but that didn't work and everything still prints out (all the junk ads). The $lines =~.. is what I need and use to line break, does anyone know a method to get rid of everything else?

Replies are listed 'Best First'.
Re: html parsing/regex
by artist (Parson) on Jul 30, 2003 at 19:54 UTC
    Hope this helps:

    (Updated as per message of the op)

    local $/; my $data = <DATA>; my @good_lines; while($data =~ m/Next Chatter \>(.*?)\< Previous Chatter/gs){ $good_lines = $1; push @good_lines,$good_lines; } foreach (@good_lines){ my @lines = split /\n/; foreach (@lines){ next unless $_; /([^:]+): (.+) \((\d+) minutes ago\)/; my( $name, $text, $delay ) = ( $1, $2, $3 ); print "NAME:$name\nText:$text\nDelay:$delay\n\n"; } }

    artist
      This script has proven to be more difficult than originally planned and more difficult than it's worth :(

      I get: NAME: Text: Delay: NAME: Text: Delay: NAME: Text: Delay: NAME: Text: Delay: NAME: Text: Delay: NAME: Text: Delay: NAME: Text: Delay: NAME: Text: Delay: NAME: Text: Delay: NAME: Text: Delay: NAME: Text: Delay: NAME: Text: Delay: NAME: Text: Delay: NAME: Text: Delay: NAME: Text: Delay: NAME: Text: Delay: NAME: Text: Delay: NAME: Text: Delay: NAME: Text: Delay: NAME: Text: Delay: NAME: Text: Delay: NAME: Text: Delay: NAME: Text: Delay: NAME: Text: Delay: NAME: Text: Delay: NAME: Text: Delay: NAME: Text: Delay: NAME: Text: Delay: NAME: Text: Delay: NAME: Text: Delay: NAME: Text: Delay: NAME: Text: Delay: NAME: Text: Delay: NAME: Text: Delay: NAME: Text: Delay: NAME: Text: Delay: NAME: Text: Delay: NAME: Text: Delay: So I think I misused a part of your script incorrectly because I'm not getting errors or anything. Did I mess up anywhere with

      use strict; use CGI qw/:standard/; use HTML::Tree; use LWP::Simple; print header, start_html('test printing'); #my $count; #until ($count eq "5") { #$count++; my $funky = "http://www.allpoetry.com/chat//page=1"; my $content = get($funky); my $tree = HTML::Tree->new(); $tree->parse($content); # retrieve the text and split into lines my @lines = split "<br>", $tree->as_text; local $/; my @good_lines; my $good_lines; for my $lines (@lines) { $lines =~ s/\)/\)<br>/g; while($lines =~ m/Next Chatter \>(.*?)\< Previous Chatter/gs){ $good_lines = $1; push @good_lines,$good_lines; } foreach (@good_lines){ my @lines = split /<br>/; foreach (@lines){ next unless $_; /([^:]+): (.+) \((\d+) minutes ago\)/; my( $name, $text, $delay ) = ( $1, $2, $3 ); print "NAME:$name\nText:$text\nDelay:$delay\n\n"; } } }
      Sorry I keep bugging you, I promise this'll be the last time (I think I'll give up for a while if nothing else works ((note to self: this is why you stopped using HTML:: modules in the first place)) ). Thanks for your help!