Split file, first 30 lines only

wrkrbeee has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: Split web page, first 30 lines only -- :content_cb trick
by Discipulus (Canon) on Feb 28, 2017 at 09:08 UTC

you get answers to solve your problem, but hoping not confusing you, i propose another solution (in Perl there is always!).

You are not forced to read the entire web page (can be an expensive task for big number of pages).

Infact get from LWP::UserAgent get the whole content unless you instruct it to behaves differently. You can specify a content_cb ie a callback to invoke for every chunk the agent retrieve from the remote server.

This bypass your need to have the 30 lines logic applied for every whole page you get.

Look at the docs of LWP::UserAgent, at this post by master zentara and at the following working example to get an idea of what i mean

use strict;
use warnings;
use LWP::UserAgent;

my @pages = ('http://www.perlmonks.org','http://perldoc.org');

my $ua = LWP::UserAgent->new;
# the line count is global
my $read_lines=1;

foreach my $url (@pages){
        my $response = $ua->get($url, ':content_cb'=>\&head_only);
}

sub head_only{
    my ($data,$resp,$protocol) = @_;
    my @lines = split "\n", $data;
    foreach my $line (@lines){
        if ($read_lines == 31){
           # reset the line count 
           $read_lines = 1;
           print +("=" x 70),"\n";
           # die inside this callback interrupt the request, not the p
+rogram!!
           # see LWP::UserAgent docs
           die;
        }
        else{
            print "line $read_lines: $line\n"
        }
        $read_lines++;
    }
}
[download]

HtH

There are no rules, there are no thumbs..
Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.

[reply]
[d/l]
[select]

Re^2: Split web page, first 30 lines only -- :content_cb trick

by wrkrbeee (Scribe) on Feb 28, 2017 at 14:19 UTC

Thank you Discipulus!

[reply]

Re^2: Split web page, first 30 lines only -- :content_cb trick

by wrkrbeee (Scribe) on Feb 28, 2017 at 21:18 UTC


strict;
use warnings;
use Tie::File;
use Fcntl;
use LWP::UserAgent;
use File::Slurp;

my @lines;

#Transfer URLS to a string variable;
my $file = "G:/Research/SEC filings 10K and 10Q/Data/sizefiles1.txt";

#Now fill @pages array with contents of sizefile1.txt ... how?
open (FH, "< $file") or die "Can't open $file for read: $!";
my @pages = <FH>;
close FH or die "Cannot close $file: $!";

#connect variable used with GET??
my $ua = LWP::UserAgent-> new;

#Initialize line counter;
my $read_lines=1;

#Primary loop through URLs ;
foreach my $url (@pages)    {
    my $response = $ua->get($url,':content_cb'=>\&head_only);
    print $response->content;
                }
                
#Subroutine for primary loop;                
sub head_only    {
    my ($data,$response,$protocol) = @_;
    my @lines = split "\n", $data;
    foreach my $line (@lines)    {
        if ($read_lines ==31)     {
            #reset line count'
            $read_lines = 1;
            print +("=" x 70), "\n"; #what is this?
            #die inside callback interrupt;
            die;
                               }
        else     {
            #print "line $read_lines: $line\n";
        }
                    }
        }
[download]

[reply]
[d/l]

Re^3: Split web page, first 30 lines only -- :content_cb trick

by Athanasius (Archbishop) on Mar 01, 2017 at 09:48 UTC

Hello wrkrbeee,

I think Discipulus provided this sample code to demonstrate a useful approach which you can adapt to your particular needs. If you want to process the read-in lines in the calling code (your “Primary loop”) rather than in the callback function, then you need to store the lines in a shared variable rather than print them in sub head_only. There is an additional complication: the last line read from the current chunk of data may not be complete, so you need to check for a trailing newline and handle its absence appropriately:

Read more... (1411 Bytes)

print +("=" x 70), "\n"; #what is this?
[download]

The x operator creates a string of 70 equals characters concatenated together:

======================================================================
[download]

— see perlop#Multiplicative-Operators. The plus sign is there to prevent the Perl parser from thinking that the parentheses contain the entire argument list to the print function — see print.

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

Re^4: Split web page, first 30 lines only -- :content_cb trick

by wrkrbeee (Scribe) on Mar 01, 2017 at 14:41 UTC

Re^3: Split web page, first 30 lines only -- :content_cb trick and populate $response object

by Discipulus (Canon) on Mar 01, 2017 at 11:51 UTC

Athanasius

my $response = $ua->get($url, ...

$ua->get($url, ...

Anyway $response it is not empty: if you dump it (i use Data:Dump's dd method) you'll see it is completly full of stuffs excepts for the _content field.

So is $response->content that is empty, not the $response itself.

In the docs is said that the callback receive three arguments: a chunk of data, a reference to the response object, and a reference to the protocol object.

So you get and handy reference to the response object and I guess you can use it to populate it's _content field. If you modify the else part of the head_only sub like:

else{
            $$resp{_content}.="$line\n"
            # print "line $read_lines: $line\n"
        }
[download]

You can now print $response->content; and get the 30 lines only. Fun, no? thanks to let me investigate such useful feature

There are no rules, there are no thumbs..
Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.

[reply]
[d/l]
[select]

Re^4: Split web page, first 30 lines only -- :content_cb trick and populate $response object

by wrkrbeee (Scribe) on Mar 01, 2017 at 14:43 UTC

Re: Split file, first 30 lines only
by hippo (Bishop) on Feb 27, 2017 at 22:55 UTC

last if $line > 29;

See last.

[reply]
[d/l]

Re^2: Split file, first 30 lines only

by wrkrbeee (Scribe) on Feb 27, 2017 at 23:01 UTC

Thanks hippo, I understand the LAST function, but key is "WHERE" to insert this idea. Tried immediately after updating the line counter, but still reads whole file. Sorry to be so inept.

[reply]

Re^3: Split file, first 30 lines only

by huck (Prior) on Feb 27, 2017 at 23:11 UTC

but still reads whole file.
thats what get($fullfile)does , read the whole file at once.

and i think @hippo meant

last if $line_count > 29;
[download]

$line_count++;

[reply]
[d/l]
[select]

Re^4: Split file, first 30 lines only

by wrkrbeee (Scribe) on Feb 27, 2017 at 23:16 UTC

Re^5: Split file, first 30 lines only

by huck (Prior) on Feb 27, 2017 at 23:24 UTC

Some notes below your chosen depth have not been shown here

Re^3: Split file, first 30 lines only

by hippo (Bishop) on Feb 28, 2017 at 08:58 UTC

Apologies. huck is correct, it's obviously the $line_count variable which should be tested rather than $line in this instance.

This will only stop it processing the whole file. If you don't want to download the whole file then that's a different matter entirely and would require use of a technique such as HTTP Ranges.

[reply]
[d/l]
[select]

Re^4: Split file, first 30 lines only

by wrkrbeee (Scribe) on Mar 01, 2017 at 20:01 UTC

Re^5: Split file, first 30 lines only (HTTP Ranges)

by hippo (Bishop) on Mar 02, 2017 at 09:38 UTC

Some notes below your chosen depth have not been shown here

Re: Split file, first 30 lines only
by 13gurpreetsingh (Novice) on Feb 28, 2017 at 06:54 UTC

for my $line (`head -30 $fullfile`)
[download]

for my $line (split qr/\'\n'/, get($fullfile))
[download]

[reply]
[d/l]
[select]

Re^2: Split file, first 30 lines only

by marto (Cardinal) on Feb 28, 2017 at 09:19 UTC

for my $line (`head -30 $fullfile`)

Consider that not every OS has the head command, this suggested route of using backticks to run a system command isn't portable.

[reply]
[d/l]
[select]

Re^3: Split file, first 30 lines only

by 13gurpreetsingh (Novice) on Mar 01, 2017 at 13:30 UTC

That's correct. That is why I said "If on unix".

[reply]

Re^4: Split file, first 30 lines only

by marto (Cardinal) on Mar 01, 2017 at 13:44 UTC

Re: Split file, first 30 lines only
by wrkrbeee (Scribe) on Feb 28, 2017 at 20:44 UTC


use strict;
use warnings;
use Tie::File;
use Fcntl;
use LWP::UserAgent;
use File::Slurp;

my @lines;

#Transfer URLS to a string variable;
my $file = "G:/Research/SEC filings 10K and 10Q/Data/sizefiles1.txt";

#Now fill @pages array with contents of sizefile1.txt ... how?
open (FH, "< $file") or die "Can't open $file for read: $!";
my @pages = <FH>;
close FH or die "Cannot close $file: $!";

#connect variable used with GET??
my $ua = LWP::UserAgent-> new;

#Initialize line counter;
my $read_lines=1;

#Primary loop through URLs ;
foreach my $url (@pages)    {
    my $response = $ua->get($url,':content_cb'=>\&head_only);
    print "$response\n";# should only be 30 lines, right? it's a lot m
+ore;
                }
                
#Subroutine for primary loop;                
sub head_only    {
    my ($data,$response,$protocol) = @_;
    my @lines = split "\n", $data;
    foreach my $line (@lines)    {
        if ($read_lines == 31)     {
            #reset line count'
            $read_lines = 1;
            print +("=" x 70), "\n"; #what is this?
            #die inside callback interrupt;
            die;
                               }
        else     {
            print "line $read_lines: $line\n Success Success";
        }
                    }
        }
[download]

[reply]
[d/l]


Come for the quick hacks, stay for the epiphanies.
	PerlMonks