Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Split file, first 30 lines only

by wrkrbeee (Scribe)
on Feb 27, 2017 at 22:44 UTC ( [id://1183027]=perlquestion: print w/replies, xml ) Need Help??

wrkrbeee has asked for the wisdom of the Perl Monks concerning the following question:

Hi Perl Monks, I am scraping data from web pages, where I only need say, the first 30 lines. I've used Perl's "split" function to "attempt" to read the file line-by-line, although I'm not overly successful. As is, I am able to obtain the desired output, albeit at the expense of reading the entire file. Hence, I need your assistance to tweak the code below (relevant loop only) such that I read only the first 30 lines. I am grateful for any insight you may have, including tips/suggestions for improving the code. Thank you! Rick
$file_count=0; foreach $filetoget(@aonly) { $fullfile="$base_url/$filetoget"; my $line_count=0; for my $line (split qr/\'\n'/, get($fullfile)) { if($line=~m/^\s*CENTRAL\s*INDEX\s*KEY:\s*(\d*)/m){$cik=$1;} if($line=~m/^\s*FORM\s*TYPE:\s*(.*$)/m){$form_type=$1;} if($line=~m/^\s*CONFORMED\s*PERIOD\s*OF\s*REPORT:\s*(\d*)/m){$ +report_date=$1;} if($line=~m/^\s*FILED\s*AS\s*OF\s*DATE:\s*(\d*)/m){$file_date= +$1;} if($line=~m/^\s*COMPANY\s*CONFORMED\s*NAME:\s*(.*$)/m){$name=$ +1;} $line_count++; print "$cik, $form_type, $report_date, $file_date, $name\n"; print "$line_count,' ', $line,' '\n"; } ### Now write the results to file!; #Open the output file; open my $FH_OUT, '>>',$write_dir or die "Can't open file $write_dir"; #Save/write results/output; $,='|'; print $FH_OUT "$cik$,$form_type$,$report_date$,$file_date$,$name$,\n"; #close $FH_IN or die "unable to close $filename"; #Update file counter; ++$file_count; print "$file_count\n"; print "$line_count lines read from $fullfile\n"; #closedir($dir_handle); close($FH_OUT); }

Replies are listed 'Best First'.
Re: Split web page, first 30 lines only -- :content_cb trick
by Discipulus (Canon) on Feb 28, 2017 at 09:08 UTC
    Hello wrkrbeee,

    you get answers to solve your problem, but hoping not confusing you, i propose another solution (in Perl there is always!).

    You are not forced to read the entire web page (can be an expensive task for big number of pages).

    Infact get from LWP::UserAgent get the whole content unless you instruct it to behaves differently. You can specify a content_cb ie a callback to invoke for every chunk the agent retrieve from the remote server.

    This bypass your need to have the 30 lines logic applied for every whole page you get.

    Look at the docs of LWP::UserAgent, at this post by master zentara and at the following working example to get an idea of what i mean

    use strict; use warnings; use LWP::UserAgent; my @pages = ('http://www.perlmonks.org','http://perldoc.org'); my $ua = LWP::UserAgent->new; # the line count is global my $read_lines=1; foreach my $url (@pages){ my $response = $ua->get($url, ':content_cb'=>\&head_only); } sub head_only{ my ($data,$resp,$protocol) = @_; my @lines = split "\n", $data; foreach my $line (@lines){ if ($read_lines == 31){ # reset the line count $read_lines = 1; print +("=" x 70),"\n"; # die inside this callback interrupt the request, not the p +rogram!! # see LWP::UserAgent docs die; } else{ print "line $read_lines: $line\n" } $read_lines++; } }

    HtH

    L*

    There are no rules, there are no thumbs..
    Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
      Thank you Discipulus!
      Hi Discipulus, Used your code suggestion (see below). I'm guessing that the $response variable will contain the 30 lines I'm looking for. As is, $response is empty. Any ideas?? Thanks so much! Rick
      strict; use warnings; use Tie::File; use Fcntl; use LWP::UserAgent; use File::Slurp; my @lines; #Transfer URLS to a string variable; my $file = "G:/Research/SEC filings 10K and 10Q/Data/sizefiles1.txt"; #Now fill @pages array with contents of sizefile1.txt ... how? open (FH, "< $file") or die "Can't open $file for read: $!"; my @pages = <FH>; close FH or die "Cannot close $file: $!"; #connect variable used with GET?? my $ua = LWP::UserAgent-> new; #Initialize line counter; my $read_lines=1; #Primary loop through URLs ; foreach my $url (@pages) { my $response = $ua->get($url,':content_cb'=>\&head_only); print $response->content; } #Subroutine for primary loop; sub head_only { my ($data,$response,$protocol) = @_; my @lines = split "\n", $data; foreach my $line (@lines) { if ($read_lines ==31) { #reset line count' $read_lines = 1; print +("=" x 70), "\n"; #what is this? #die inside callback interrupt; die; } else { #print "line $read_lines: $line\n"; } } }

        Hello wrkrbeee,

        I think Discipulus provided this sample code to demonstrate a useful approach which you can adapt to your particular needs. If you want to process the read-in lines in the calling code (your “Primary loop”) rather than in the callback function, then you need to store the lines in a shared variable rather than print them in sub head_only. There is an additional complication: the last line read from the current chunk of data may not be complete, so you need to check for a trailing newline and handle its absence appropriately:

        print +("=" x 70), "\n"; #what is this?

        The x operator creates a string of 70 equals characters concatenated together:

        ======================================================================

        — see perlop#Multiplicative-Operators. The plus sign is there to prevent the Perl parser from thinking that the parentheses contain the entire argument list to the print function — see print.

        Hope that helps,

        Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

        well, you got a good answer from estimated brother Athanasius and you are right in my code my $response = $ua->get($url, ... could have be simply $ua->get($url, ... because the 30 lines are printed in the callbak.

        Anyway $response it is not empty: if you dump it (i use Data:Dump's dd method) you'll see it is completly full of stuffs excepts for the _content field.

        So is $response->content that is empty, not the $response itself.

        In the docs is said that the callback receive three arguments: a chunk of data, a reference to the response object, and a reference to the protocol object.

        So you get and handy reference to the response object and I guess you can use it to populate it's _content field. If you modify the else part of the head_only sub like:

        else{ $$resp{_content}.="$line\n" # print "line $read_lines: $line\n" }

        You can now print $response->content; and get the 30 lines only. Fun, no? thanks to let me investigate such useful feature

        L*

        There are no rules, there are no thumbs..
        Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
Re: Split file, first 30 lines only
by hippo (Bishop) on Feb 27, 2017 at 22:55 UTC
    last if $line > 29;

    See last.

      Thanks hippo, I understand the LAST function, but key is "WHERE" to insert this idea. Tried immediately after updating the line counter, but still reads whole file. Sorry to be so inept.

        but still reads whole file.
        thats what get($fullfile)does , read the whole file at once.

        and i think @hippo meant

        last if $line_count > 29;
        Put it right after $line_count++;. You may Read the whole file via get, but then only PROCESS the beginning

        Apologies. huck is correct, it's obviously the $line_count variable which should be tested rather than $line in this instance.

        This will only stop it processing the whole file. If you don't want to download the whole file then that's a different matter entirely and would require use of a technique such as HTTP Ranges.

Re: Split file, first 30 lines only
by 13gurpreetsingh (Novice) on Feb 28, 2017 at 06:54 UTC
    If on unix, you may go for :
    for my $line (`head -30 $fullfile`)
    instead of
    for my $line (split qr/\'\n'/, get($fullfile))
      for my $line (`head -30 $fullfile`)

      Consider that not every OS has the head command, this suggested route of using backticks to run a system command isn't portable.

        That's correct. That is why I said "If on unix".
Re: Split file, first 30 lines only
by wrkrbeee (Scribe) on Feb 28, 2017 at 20:44 UTC
    Hi everyone, Discipulus provided the code below, adapted for my scenario. Pgm runs, but reads the entire file rather than the first 30 lines. Anyone have any ideas why? Thank you!!!
    use strict; use warnings; use Tie::File; use Fcntl; use LWP::UserAgent; use File::Slurp; my @lines; #Transfer URLS to a string variable; my $file = "G:/Research/SEC filings 10K and 10Q/Data/sizefiles1.txt"; #Now fill @pages array with contents of sizefile1.txt ... how? open (FH, "< $file") or die "Can't open $file for read: $!"; my @pages = <FH>; close FH or die "Cannot close $file: $!"; #connect variable used with GET?? my $ua = LWP::UserAgent-> new; #Initialize line counter; my $read_lines=1; #Primary loop through URLs ; foreach my $url (@pages) { my $response = $ua->get($url,':content_cb'=>\&head_only); print "$response\n";# should only be 30 lines, right? it's a lot m +ore; } #Subroutine for primary loop; sub head_only { my ($data,$response,$protocol) = @_; my @lines = split "\n", $data; foreach my $line (@lines) { if ($read_lines == 31) { #reset line count' $read_lines = 1; print +("=" x 70), "\n"; #what is this? #die inside callback interrupt; die; } else { print "line $read_lines: $line\n Success Success"; } } }

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1183027]
Approved by beech
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (7)
As of 2024-04-18 06:43 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found