Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Re: Split web page, first 30 lines only -- :content_cb trick

by Discipulus (Canon)
on Feb 28, 2017 at 09:08 UTC ( [id://1183107]=note: print w/replies, xml ) Need Help??


in reply to Split file, first 30 lines only

Hello wrkrbeee,

you get answers to solve your problem, but hoping not confusing you, i propose another solution (in Perl there is always!).

You are not forced to read the entire web page (can be an expensive task for big number of pages).

Infact get from LWP::UserAgent get the whole content unless you instruct it to behaves differently. You can specify a content_cb ie a callback to invoke for every chunk the agent retrieve from the remote server.

This bypass your need to have the 30 lines logic applied for every whole page you get.

Look at the docs of LWP::UserAgent, at this post by master zentara and at the following working example to get an idea of what i mean

use strict; use warnings; use LWP::UserAgent; my @pages = ('http://www.perlmonks.org','http://perldoc.org'); my $ua = LWP::UserAgent->new; # the line count is global my $read_lines=1; foreach my $url (@pages){ my $response = $ua->get($url, ':content_cb'=>\&head_only); } sub head_only{ my ($data,$resp,$protocol) = @_; my @lines = split "\n", $data; foreach my $line (@lines){ if ($read_lines == 31){ # reset the line count $read_lines = 1; print +("=" x 70),"\n"; # die inside this callback interrupt the request, not the p +rogram!! # see LWP::UserAgent docs die; } else{ print "line $read_lines: $line\n" } $read_lines++; } }

HtH

L*

There are no rules, there are no thumbs..
Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.

Replies are listed 'Best First'.
Re^2: Split web page, first 30 lines only -- :content_cb trick
by wrkrbeee (Scribe) on Feb 28, 2017 at 14:19 UTC
    Thank you Discipulus!
Re^2: Split web page, first 30 lines only -- :content_cb trick
by wrkrbeee (Scribe) on Feb 28, 2017 at 21:18 UTC
    Hi Discipulus, Used your code suggestion (see below). I'm guessing that the $response variable will contain the 30 lines I'm looking for. As is, $response is empty. Any ideas?? Thanks so much! Rick
    strict; use warnings; use Tie::File; use Fcntl; use LWP::UserAgent; use File::Slurp; my @lines; #Transfer URLS to a string variable; my $file = "G:/Research/SEC filings 10K and 10Q/Data/sizefiles1.txt"; #Now fill @pages array with contents of sizefile1.txt ... how? open (FH, "< $file") or die "Can't open $file for read: $!"; my @pages = <FH>; close FH or die "Cannot close $file: $!"; #connect variable used with GET?? my $ua = LWP::UserAgent-> new; #Initialize line counter; my $read_lines=1; #Primary loop through URLs ; foreach my $url (@pages) { my $response = $ua->get($url,':content_cb'=>\&head_only); print $response->content; } #Subroutine for primary loop; sub head_only { my ($data,$response,$protocol) = @_; my @lines = split "\n", $data; foreach my $line (@lines) { if ($read_lines ==31) { #reset line count' $read_lines = 1; print +("=" x 70), "\n"; #what is this? #die inside callback interrupt; die; } else { #print "line $read_lines: $line\n"; } } }

      Hello wrkrbeee,

      I think Discipulus provided this sample code to demonstrate a useful approach which you can adapt to your particular needs. If you want to process the read-in lines in the calling code (your “Primary loop”) rather than in the callback function, then you need to store the lines in a shared variable rather than print them in sub head_only. There is an additional complication: the last line read from the current chunk of data may not be complete, so you need to check for a trailing newline and handle its absence appropriately:

      print +("=" x 70), "\n"; #what is this?

      The x operator creates a string of 70 equals characters concatenated together:

      ======================================================================

      — see perlop#Multiplicative-Operators. The plus sign is there to prevent the Perl parser from thinking that the parentheses contain the entire argument list to the print function — see print.

      Hope that helps,

      Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

        Thank you Athanasius! Appreciate your time and effort!
      well, you got a good answer from estimated brother Athanasius and you are right in my code my $response = $ua->get($url, ... could have be simply $ua->get($url, ... because the 30 lines are printed in the callbak.

      Anyway $response it is not empty: if you dump it (i use Data:Dump's dd method) you'll see it is completly full of stuffs excepts for the _content field.

      So is $response->content that is empty, not the $response itself.

      In the docs is said that the callback receive three arguments: a chunk of data, a reference to the response object, and a reference to the protocol object.

      So you get and handy reference to the response object and I guess you can use it to populate it's _content field. If you modify the else part of the head_only sub like:

      else{ $$resp{_content}.="$line\n" # print "line $read_lines: $line\n" }

      You can now print $response->content; and get the 30 lines only. Fun, no? thanks to let me investigate such useful feature

      L*

      There are no rules, there are no thumbs..
      Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
        Thank you Discipulus! I appreciate your help very much!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1183107]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (3)
As of 2024-04-20 07:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found