Sucking down a file

donfreenut has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Sucking down a file by Fastolfe (Vicar) on Jan 31, 2001 at 03:04 UTC
You might try something like this: `foreach (@lines) { next unless /start-phrase-here/ .. /stop-phrase-here/; # process }` [download] If you have no 'stop' phrase, you might use 'undef' on that side of the `..` operator, or go with a more logical-looking flag approach: `my $flag; foreach (@lines) { $flag++ if /start-phrase/; next unless $flag; # process }` [download] U: I can't believe I didn't mention this originally, but you should probably be aware that parsing any HTML on your own is going to be extremely hard/unreliable, unless you have precise control over the formatting of the page. Better to use an HTML::Parser-derived module to pull the HTML data into Perl, and then work with the Perl data structure to get what you need.	[reply] [d/l] [select]
Re: Sucking down a file by ichimunki (Priest) on Jan 31, 2001 at 03:14 UTC
Unreliably {grin}. I realize you are probably extracting from a consistent source, but webpages don't necessarily have \n delimiters in logical places, and some sites may have \r in them as well. I'd suggest using something like HTML::TokeParser. It will break the page up into tokens which consist of start tags, end tags, and text tags (and a few other things you probably don't care about). You can easily grab and discard tokens until you get to the one that matches your criterion, then start processing from there.	[reply]
Re: Sucking down a file by Vane (Novice) on Jan 31, 2001 at 03:57 UTC
Yet another approach: Don't break the content into lines. while ($html =~ m/.../gis) { push @pos, pos $html; } #see the pos doc.	[reply]
Re: Sucking down a file by YaRness (Initiate) on Jan 31, 2001 at 22:07 UTC
i dunno if this quite applies, but i work with logs that i usually pump into arrays. these logs have a lot of garbage, so i wrote a quick subroutine to basically shift off lines i don't care about until i find what i'm looking for. so assuming you have no problem extracting the text you wanna look at, you can use something like this: `sub shift_until($\@) { #usage: shift_until($somepattern, @list); #this takes a @list and throws away lines until it hits $pattern #$pattern can be a regexp. #if $pattern isn't found, the list is emptied completely. #it's good for parsing through garbage #(hey, i never said it would be useful to everybody) my $pattern = shift; my $array_ref = $_[0]; while (@$array_ref and not $$array_ref[0] =~ /$pattern/) { shift @$array_ref; } (@$array_ref and return 1) or return 0; }` [download] this could easily be modified to non-destructively seek to a position in the array, or to use a file instead of an array, ad nauseum.	[reply] [d/l]