Reading (the same) data in different ways & memory usage

Neighbour has asked for the wisdom of the Perl Monks concerning the following question:

Having finally found a working Devel::-module from CPAN (Devel::Size) I'm trying to figure out why my memory usage goes through the roof and into the swapfile when reading large chunks of data.

The data concerned can vary a lot, but in this testcase it's a recordset with 119 fields per record and 47039 records in the set.

Performing a simple my $ar_data = $db->selectall_arrayref("SELECT * FROM testtable", { Slice => {} }); yields a recordset that, according to Devel::Size::total_size is 449337164 bytes. This is about 9553 bytes/record. I can live with that.

However, when writing the same data to a fixed-length file, and subsequently reading it in a new variable, the size turns out to be 773605490 bytes, 16446 bytes/record. The code used to read the data:

# ReadData ($filename) returns ar_data
sub ReadData ($$) {
    my ($self, $filename) = @_;
    my $ar_returnvalue = [];
    if (!-e "$filename") {
        Carp::carp("File [$filename] does not exist");
        return undef;
    }
    open (FLATFILE, '<', $filename) or Carp::croak("Cannot open file [
+$filename]");
    while (<FLATFILE>) {
        chomp;
        push (@{$ar_returnvalue}, Interfaces::FlatFile::ReadRecord($se
+lf, $_));
    }
    close (FLATFILE);
    return $ar_returnvalue;
} ## end sub ReadData ($$)

sub ReadRecord ($$) {
    my ($self, $textinput) = @_;
    my $hr_returnvalue = {};
    my $CurrentColumnName;
    for (0 .. $#{$self->columns}) {
        $CurrentColumnName = $self->columns->[$_];
        if (!(defined $self->flatfield_start->[$_] and defined $self->
+flatfield_length->[$_])) {
            # Field is missing interface_start, interface_length or bo
+th, skip it.
            next;
        }
        $hr_returnvalue->{$CurrentColumnName} = substr ($textinput, $s
+elf->flatfield_start->[$_], $self->flatfield_length->[$_]);
        $hr_returnvalue->{$CurrentColumnName} =~ s/^\s*(.*?)\s*$/$1/; 
+   # Trim whitespace
        # Fill empty fields with that field's default value, if such a
+ value is defined.
        if ($hr_returnvalue->{$CurrentColumnName} eq "") { 
            if (defined $self->standaard->[$_]) {
                if ($self->datatype->[$_] =~ /^(?:CHAR|VARCHAR|DATE|TI
+ME|DATETIME)$/) {
                    $hr_returnvalue->{$CurrentColumnName} = sprintf ("
+%s", $self->standaard->[$_]);
                } else {
                    $hr_returnvalue->{$CurrentColumnName} = $self->sta
+ndaard->[$_];
                }
            } else {
                # Remove empty field
                delete $hr_returnvalue->{$CurrentColumnName};
            }
        } 
        if ($self->datatype->[$_] =~ /^(?:TINYINT|MEDIUMINT|SMALLINT|I
+NT|INTEGER|BIGINT|FLOAT|DOUBLE)$/) {
            $hr_returnvalue->{$CurrentColumnName} *= 1;# Multiply by 1
+ to create a numeric value.
        }
        # Decimal-correction
        if ($self->decimals->[$_] > 0 and defined $hr_returnvalue->{$C
+urrentColumnName}) {
            $hr_returnvalue->{$CurrentColumnName} /= 10**$self->decima
+ls->[$_];
        }
    } ## end for (0 .. $#{$self->columns...
    return $hr_returnvalue;
} ## end sub ReadRecord ($$)
[download]

The above code is from a custom-made Interfaces-object, with an Interfaces::FlatFile role (yes, Moose) that provides fixed-length file interfacing. The object contains the following attributes (only the ones used here are shown):

has 'columns'     => (is => 'rw', isa => 'ArrayRef[Str]',             
+     lazy_build => 1,);
has 'datatype'    => (is => 'rw', isa => 'ArrayRef[Str]',             
+     lazy_build => 1,);
has 'decimals'    => (is => 'rw', isa => 'ArrayRef[Maybe[Int]]',      
+     lazy_build => 1,);
has 'default'   => (is => 'rw', isa => 'ArrayRef[Maybe[Value]]',      
+   lazy_build => 1,);
has 'flatfield_start'  => (is => 'rw', isa => 'ArrayRef[Maybe[Int]]', 
+lazy_build => 1,);
has 'flatfield_length' => (is => 'rw', isa => 'ArrayRef[Maybe[Int]]', 
+lazy_build => 1,);
[download]

These attributes are filled by index, so all the above attributes with index n refer to the same field n.

The question is thus: Why does reading from a fixed-length file need much more memory, and what can I do to fix that? :)

Comment on Reading (the same) data in different ways & memory usage Select or Download Code

Replies are listed 'Best First'.
Re: Reading (the same) data in different ways & memory usage by moritz (Cardinal) on Apr 19, 2011 at 12:28 UTC
The first difference I see is that `selectall_arrayref` returns an array refs of array refs, whereas your homemade code seems to work with hash references. So maybe it's not the same size because the data structures are quite different? `$ perl -MDevel::Size=total_size -wE 'say total_size [1, 2, 3, 4]' 200 $ perl -MDevel::Size=total_size -wE 'say total_size { foo => 1, bar => + 2, baz => 3, blubb => 4}' 382` [download] Perl 6 - second systems done right	[reply] [d/l] [select]
Re^2: Reading (the same) data in different ways & memory usage by Neighbour (Friar) on Apr 19, 2011 at 12:44 UTC
You can persuade `selectall_arrayref` to return an arrayref of hashrefs using the `{ Slice => {} }` trick as described in the DBI manual. (edit) selectall_arrayref returns an arrayref, not an array :)	[reply] [d/l] [select]
Re: Reading (the same) data in different ways & memory usage by BrowserUk (Patriarch) on Apr 19, 2011 at 13:41 UTC
The probable reason is that you are storing numeric values as PVs--their string representation as read from the file--in addition to IVs--their numeric representation--the generation of which you are deliberately forcing with this code: `if ($self->datatype->[$_] =~ /^(?:TINYINT\|MEDIUMINT\|SMALLINT\|INT\|INTEGER\|BIGINT\|FLOAT\|DOUBLE)$/ +) { $hr_returnvalue->{$CurrentColumnName} = 1; # Multiply by 1 to create a numeric value. }` [download] Having initially loaded the value as a string (PV), when you force it to be converted to a numeric value (IV), the string value will be retained so that if you later decide to use it in a string context, the (inverse) conversion does not have to be repeated. Eg. After the `= 1;`, the PV is still there, but you gained an IV. Essentially you've increased the size of the SV rather than reduce it (as I assume you intended): `C:\test>perl -MDevel::Peek -E"my $x = '12345'; Dump $x; $x =1; Dump $ +x" SV = PV(0x6cf50) at 0xc74e8 REFCNT = 1 FLAGS = (PADMY,POK,pPOK) PV = 0x67758 "12345"\0 CUR = 5 LEN = 8 SV = PVIV(0xaf018) at 0xc74e8 REFCNT = 1 FLAGS = (PADMY,IOK,pIOK) IV = 12345 PV = 0x67758 "12345"\0 CUR = 5 LEN = 8` [download] A fix would be to perform the string->numeric conversion before storing the value: for( 0 .. $#{$self->columns} ) { $CurrentColumnName = $self->columns->[$_]; if( !(defined $self->flatfield_start->[$_] and defined $self->flatfield_length->[$_] ) ) { # Field is missing interface_start, interface_length or both, skip + it. next; } if ($self->datatype->[$_] =~ /^(?:TINYINT\|MEDIUMINT\|SMALLINT\|INT\|INTEGER\|BIGINT\|FLOAT\|DOUB +LE)$/ ) { $hr_returnvalue->{$CurrentColumnName} = 0 + substr( $textinput, $self->flatfield_start->[$_], $self->flatfield_length-> +[$_] ); }else { $hr_returnvalue->{$CurrentColumnName} = substr( $textinput, $self->flatfield_start->[$_], $self->flatfield_length +->[$_] ); $hr_returnvalue->{ $CurrentColumnName } =~ s/^\s(.?)\s$/$1/; # Trim whitespace } # Fill empty fields with that field's default value, if such a value +is defined [download] That should reduce the size of the final data structure significantly if there are many numeric fields. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re^2: Reading (the same) data in different ways & memory usage by Neighbour (Friar) on Apr 19, 2011 at 14:52 UTC
I implemented your idea with a slight variation (moved the decimal-correction in the numeric data branch and put the check for empty fields in the non-numeric branch: if ($self->datatype->[$_] =~ /^(?:TINYINT\|MEDIUMINT\|SMALLINT\|I +NT\|INTEGER\|BIGINT\|FLOAT\|DOUBLE)$/) { $hr_returnvalue->{$CurrentColumnName} = 0 + substr ($texti +nput, $self->flatfield_start->[$_], $self->flatfield_length->[$_]);# +create a numeric value. # Decimal-correction if ($self->decimals->[$_] > 0 and defined $hr_returnvalue- +>{$CurrentColumnName}) { $hr_returnvalue->{$CurrentColumnName} /= 10*$self->de +cimals->[$_]; } } else { $hr_returnvalue->{$CurrentColumnName} = substr ($textinput +, $self->flatfield_start->[$_], $self->flatfield_length->[$_]); $hr_returnvalue->{$CurrentColumnName} =~ s/^\s(.?)\s$/$ +1/; # Trim whitespace # Fill empty fields with that field's default value, if su +ch a value is defined if ($hr_returnvalue->{$CurrentColumnName} eq "") { if (defined $self->standadefaultard->[$_]) { if ($self->datatype->[$_] =~ /^(?:CHAR\|VARCHAR\|DAT +E\|TIME\|DATETIME)$/) { $hr_returnvalue->{$CurrentColumnName} = sprint +f ("%s", $self->default->[$_]); } else { $hr_returnvalue->{$CurrentColumnName} = $self- +>default->[$_]; } } else { # Remove empty field delete $hr_returnvalue->{$CurrentColumnName}; } } } [download] but the idea is sound. Devel::Size now reports the returned data-structure to be 385251506 bytes, which, for some reason is smaller than the data-structure retrieved from the db...I'll have to look at things more closely to figure out why that is.	[reply] [d/l]
Re^3: Reading (the same) data in different ways & memory usage by BrowserUk (Patriarch) on Apr 19, 2011 at 15:30 UTC
the returned data-structure to be 385251506 bytes, which, for some reason is smaller than the data-structure retrieved from the db Perhaps the DBI code doesn't trim leading/trailing spaces on string fields? Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^4: Reading (the same) data in different ways & memory usage by Tux (Canon) on Apr 19, 2011 at 18:42 UTC
Re^5: Reading (the same) data in different ways & memory usage by BrowserUk (Patriarch) on Apr 19, 2011 at 19:27 UTC
Some notes below your chosen depth have not been shown here
Re^4: Reading (the same) data in different ways & memory usage by Neighbour (Friar) on Apr 21, 2011 at 06:49 UTC
Re^5: Reading (the same) data in different ways & memory usage by BrowserUk (Patriarch) on Apr 21, 2011 at 08:14 UTC
Some notes below your chosen depth have not been shown here
Re: Reading (the same) data in different ways & memory usage by jwkrahn (Abbot) on Apr 20, 2011 at 03:59 UTC
This doesn't answer your question but: `sub ReadData ($$) { my ($self, $filename) = @_; my $ar_returnvalue = []; if (!-e "$filename") { Carp::carp("File [$filename] does not exist"); return undef; } open (FLATFILE, '<', $filename) or Carp::croak("Cannot open file [ +$filename]"); while (<FLATFILE>) { chomp; push (@{$ar_returnvalue}, Interfaces::FlatFile::ReadRecord($se +lf, $_)); } close (FLATFILE); return $ar_returnvalue; } ## end sub ReadData ($$)` [download] You are using prototypes but prototypes were introduced to allow programmers to imitate Perl's built-in functions, not for user code per se. FMTEYEWTK on Prototypes in Perl You are testing for the existence of a file twice, first with stat and then with open. In the stat test you are unnecessarily copying the file name to a string before testing it. What's wrong with always quoting "$vars"? You should include the $! variable in your error messages so you know why they failed. `$hr_returnvalue->{$CurrentColumnName} =~ s/^\s(.?)\s*$/$1/; + # Trim whitespace` [download] That is usually written as: `s/^\s+//, s/\s+$// for $hr_returnvalue->{$CurrentColumnName}; + # Trim whitespace` [download] Which avoids unnecessary substitution. `if ($self->datatype->[$_] =~ /^(?:CHAR\|VARCHAR\|DATE\|TI +ME\|DATETIME)$/) { $hr_returnvalue->{$CurrentColumnName} = sprintf (" +%s", $self->standaard->[$_]); } else { $hr_returnvalue->{$CurrentColumnName} = $self->sta +ndaard->[$_]; }` [download] What is the sprintf doing that the simple assignment is not doing? It looks like you don't need this test at all.	[reply] [d/l] [select]
Re^2: Reading (the same) data in different ways & memory usage by Neighbour (Friar) on Apr 21, 2011 at 08:07 UTC
And so I'm learning new things every day :) I have experienced difficulties with the @ and % prototypes, as I read in the link you provided, and stopped using those. The scalar prototypes are now purely used to enforce the right amount of arguments (as an added bonus, perl implicitly coerces any arguments supplied to scalars...this will definitely mess things up, but if you were supplying non-scalars to this function, that was bound to happen anyway, so I'm not worried about that much). It seems that open accepts one thing besides strings as the EXPR-argument (2nd or 3rd if a MODE is supplied), and that is a reference to a scalar to be used as an in-memory file. Even though this is not the case here, I've removed the stat-check, since open will give an error anyway if the file doesn't exist. Also $! has been included in the errormessage, should open fail. How does one substitution with capture compare to doing 2 substitutions without capture? I would have to benchmark this to figure out which is faster. The sprintf seems out of place, though, as with all user-supplied data, it's not certain that the default-values (looks like I missed one when translating "standaard" to "default") for (VAR)CHAR fields actually contains a string-value. However, this can also be done just using ""'s.	[reply]
Re^3: Reading (the same) data in different ways & memory usage by jwkrahn (Abbot) on Apr 21, 2011 at 10:59 UTC
How does one substitution with capture compare to doing 2 substitutions without capture? In your example: `$hr_returnvalue->{$CurrentColumnName} =~ s/^\s(.?)\s$/$1/; + # Trim whitespace` [download] The regular expression `/^\s(.?)\s$/` will always match, regardless if there is or is not whitespace present, so the substitution will always be done. In my example: `s/^\s+//, s/\s+$// for $hr_returnvalue->{$CurrentColumnName}; + # Trim whitespace` [download] The regular expressions `/^\s+/` and `/\s+$/` will only match if there is whitespace present and so the substitution will only be done on the occurrence of whitespace. Running a benchmark would be good start, and you should try to use data similar to, or the actual data that your program uses.	[reply] [d/l] [select]