learning.moose has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks

I am currently parsing 50.000.000 line file, and I want to do it line by line. Most obvious way is to do it with <>

while(my $line = <HANDLER>){}

But I want to do it via object environment, the goal is to do something like:

my $file = MyParser->new( 'file.csv' ); while( my $line = file->nextLine() ) { }

I am wondering now whats the most efficient way to do it. I don't want to read whole file into memory, I also would like not to use way too much IO operations on my HDD. Was trying Tie::File but it's too slow, accessing $array3500000 took me 22 seconds. I also want to avoid using threads.

Replies are listed 'Best First'.
Re: Moose reading file line by line
by Corion (Patriarch) on Apr 27, 2015 at 12:46 UTC

    Why not store the filehandle in your object and read one line from it in the nextLine sub?

    Where does Moose come into play in your problem?

      I also was trying to do it this way:

      has 'handler' => ( isa => 'FileHandle', is => 'rw' ); sub openHandler { my $self = shift; open( $self->handler, '<', $self->file ) or croak "Can't open file +"; } sub closeHandler { my $self = shift; close $self->handler; } sub readLine { my $self = shift; return <$self->handler>; }

      but im getting Bareword syntax error

        It would be easier if you told me where Perl reports a syntax error.

        The following is not really valid Perl:

        return <$self->handler>;

        See readline resp. I/O-Operators about how to use a more complex expression as a filehandle. You could also just simplify your code to the following construct:

        sub readLine { my $self = shift; my $fh= $self->handler; return <$fh>; }

        I'm not sure whether your open $self->handler actually doess the right thing, but then, I don't know how Moose proposes you store filehandles and what the 'FileHandle' type is supposed to do. I presume that $self->handler does not return a reference to the actual filehandle, so your open might just be useless because you never store the actual opened filehandle back into the object.

Re: Moose reading big file line by line
by BillKSmith (Monsignor) on Apr 27, 2015 at 13:36 UTC
    You want the nextLine method to buffer your input. The buffer should contain a block of lines from the file. The method returns the next line from the buffer. If the buffer is empty, refill it with the next block from the file. This is fairly straight forward if all lines are exactly the same length. If not, I doubt that the benefit is worth the effort to get it right.
    Bill
      Why not simply leave the buffering to the OS? Not every <readline> will necessarily cause a physical disk access. Unless the lines are very long, there is a good chance that many lines will be read at once and made available to you without again going to the disk.

      CountZero

      A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      My blog: Imperial Deltronics
Re: Moose reading big file line by line
by Laurent_R (Canon) on Apr 27, 2015 at 19:44 UTC
    Reading the file line by line will not damage your HDD, because Perl and the OS are buffering input by default.

    Now, what you really need to read a file line by line is an iterator, and Perl provides a very convenient file iterator, the readline function or the <$FH> operator.

    If you really want to do it OO, you can create an open_file method, a next_line method that will just return the next line of input to the caller, and a close_file method, but that seems to me to be overkill: these three functions already exist in core Perl. Why would you want to wrap them into OO abstraction? I appreciate that you are trying to learn Moose, but, IMHO, using Moose just to read a file line by line is just plain over-engineering.

    Je suis Charlie.
Re: Moose reading big file line by line
by CountZero (Bishop) on Apr 28, 2015 at 14:48 UTC
    I'm not sure it is such a good idea to use Moose (or any other object system) for this task, but nothing tried, nothing gained!

    In FileReader.pm

    package FileReader; use Moose; use Moose::Util::TypeConstraints; use IO::File; subtype 'FileHandleFromStr', as 'FileHandle'; coerce 'FileHandleFromStr', from 'Str', via { IO::File->new("< $_") }; has file => ( is => 'ro', isa => 'FileHandleFromStr', coerce => 1, ); sub read_next_line { my $self = shift; return $self->file->getline; } 1
    In your script:
    use Modern::Perl qw/2014/; use lib 'C:/Data/strawberry/script-chrome'; use FileReader; use Data::Dump qw /dump/; my $file = FileReader->new( file => 'C:/Data/strawberry/script-chrome/test.txt' + ); while ( my $line = $file->read_next_line ) { print $line; }
    Moose will take care of opening the file and attaching the filehandle to your object. You then call the method read_next_line on this object to ... read the next line.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics
      Who said anything about Strawberry Perl?
        I just happen to use Strawberry Perl, but I fail to see the relevance of your comment.

        CountZero

        A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

        My blog: Imperial Deltronics