Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Reading (the same) data in different ways & memory usage

by Neighbour (Friar)
on Apr 19, 2011 at 12:17 UTC ( [id://900094]=perlquestion: print w/replies, xml ) Need Help??

Neighbour has asked for the wisdom of the Perl Monks concerning the following question:

Having finally found a working Devel::-module from CPAN (Devel::Size) I'm trying to figure out why my memory usage goes through the roof and into the swapfile when reading large chunks of data.

The data concerned can vary a lot, but in this testcase it's a recordset with 119 fields per record and 47039 records in the set.

Performing a simple my $ar_data = $db->selectall_arrayref("SELECT * FROM testtable", { Slice => {} }); yields a recordset that, according to Devel::Size::total_size is 449337164 bytes. This is about 9553 bytes/record. I can live with that.

However, when writing the same data to a fixed-length file, and subsequently reading it in a new variable, the size turns out to be 773605490 bytes, 16446 bytes/record. The code used to read the data:

# ReadData ($filename) returns ar_data sub ReadData ($$) { my ($self, $filename) = @_; my $ar_returnvalue = []; if (!-e "$filename") { Carp::carp("File [$filename] does not exist"); return undef; } open (FLATFILE, '<', $filename) or Carp::croak("Cannot open file [ +$filename]"); while (<FLATFILE>) { chomp; push (@{$ar_returnvalue}, Interfaces::FlatFile::ReadRecord($se +lf, $_)); } close (FLATFILE); return $ar_returnvalue; } ## end sub ReadData ($$) sub ReadRecord ($$) { my ($self, $textinput) = @_; my $hr_returnvalue = {}; my $CurrentColumnName; for (0 .. $#{$self->columns}) { $CurrentColumnName = $self->columns->[$_]; if (!(defined $self->flatfield_start->[$_] and defined $self-> +flatfield_length->[$_])) { # Field is missing interface_start, interface_length or bo +th, skip it. next; } $hr_returnvalue->{$CurrentColumnName} = substr ($textinput, $s +elf->flatfield_start->[$_], $self->flatfield_length->[$_]); $hr_returnvalue->{$CurrentColumnName} =~ s/^\s*(.*?)\s*$/$1/; + # Trim whitespace # Fill empty fields with that field's default value, if such a + value is defined. if ($hr_returnvalue->{$CurrentColumnName} eq "") { if (defined $self->standaard->[$_]) { if ($self->datatype->[$_] =~ /^(?:CHAR|VARCHAR|DATE|TI +ME|DATETIME)$/) { $hr_returnvalue->{$CurrentColumnName} = sprintf (" +%s", $self->standaard->[$_]); } else { $hr_returnvalue->{$CurrentColumnName} = $self->sta +ndaard->[$_]; } } else { # Remove empty field delete $hr_returnvalue->{$CurrentColumnName}; } } if ($self->datatype->[$_] =~ /^(?:TINYINT|MEDIUMINT|SMALLINT|I +NT|INTEGER|BIGINT|FLOAT|DOUBLE)$/) { $hr_returnvalue->{$CurrentColumnName} *= 1;# Multiply by 1 + to create a numeric value. } # Decimal-correction if ($self->decimals->[$_] > 0 and defined $hr_returnvalue->{$C +urrentColumnName}) { $hr_returnvalue->{$CurrentColumnName} /= 10**$self->decima +ls->[$_]; } } ## end for (0 .. $#{$self->columns... return $hr_returnvalue; } ## end sub ReadRecord ($$)
The above code is from a custom-made Interfaces-object, with an Interfaces::FlatFile role (yes, Moose) that provides fixed-length file interfacing. The object contains the following attributes (only the ones used here are shown):
has 'columns' => (is => 'rw', isa => 'ArrayRef[Str]', + lazy_build => 1,); has 'datatype' => (is => 'rw', isa => 'ArrayRef[Str]', + lazy_build => 1,); has 'decimals' => (is => 'rw', isa => 'ArrayRef[Maybe[Int]]', + lazy_build => 1,); has 'default' => (is => 'rw', isa => 'ArrayRef[Maybe[Value]]', + lazy_build => 1,); has 'flatfield_start' => (is => 'rw', isa => 'ArrayRef[Maybe[Int]]', +lazy_build => 1,); has 'flatfield_length' => (is => 'rw', isa => 'ArrayRef[Maybe[Int]]', +lazy_build => 1,);
These attributes are filled by index, so all the above attributes with index n refer to the same field n.

The question is thus: Why does reading from a fixed-length file need much more memory, and what can I do to fix that? :)

Replies are listed 'Best First'.
Re: Reading (the same) data in different ways & memory usage
by moritz (Cardinal) on Apr 19, 2011 at 12:28 UTC

    The first difference I see is that selectall_arrayref returns an array refs of array refs, whereas your homemade code seems to work with hash references. So maybe it's not the same size because the data structures are quite different?

    $ perl -MDevel::Size=total_size -wE 'say total_size [1, 2, 3, 4]' 200 $ perl -MDevel::Size=total_size -wE 'say total_size { foo => 1, bar => + 2, baz => 3, blubb => 4}' 382
      You can persuade selectall_arrayref to return an arrayref of hashrefs using the { Slice => {} } trick as described in the DBI manual.

      (edit) selectall_arrayref returns an arrayref, not an array :)

Re: Reading (the same) data in different ways & memory usage
by BrowserUk (Patriarch) on Apr 19, 2011 at 13:41 UTC

    The probable reason is that you are storing numeric values as PVs--their string representation as read from the file--in addition to IVs--their numeric representation--the generation of which you are deliberately forcing with this code:

    if ($self->datatype->[$_] =~ /^(?:TINYINT|MEDIUMINT|SMALLINT|INT|INTEGER|BIGINT|FLOAT|DOUBLE)$/ +) { $hr_returnvalue->{$CurrentColumnName} *= 1; # Multiply by 1 to create a numeric value. }

    Having initially loaded the value as a string (PV), when you force it to be converted to a numeric value (IV), the string value will be retained so that if you later decide to use it in a string context, the (inverse) conversion does not have to be repeated.

    Eg. After the *= 1;, the PV is still there, but you gained an IV. Essentially you've increased the size of the SV rather than reduce it (as I assume you intended):

    C:\test>perl -MDevel::Peek -E"my $x = '12345'; Dump $x; $x *=1; Dump $ +x" SV = PV(0x6cf50) at 0xc74e8 REFCNT = 1 FLAGS = (PADMY,POK,pPOK) PV = 0x67758 "12345"\0 CUR = 5 LEN = 8 SV = PVIV(0xaf018) at 0xc74e8 REFCNT = 1 FLAGS = (PADMY,IOK,pIOK) IV = 12345 PV = 0x67758 "12345"\0 CUR = 5 LEN = 8

    A fix would be to perform the string->numeric conversion before storing the value:

    for( 0 .. $#{$self->columns} ) { $CurrentColumnName = $self->columns->[$_]; if( !(defined $self->flatfield_start->[$_] and defined $self->flatfield_length->[$_] ) ) { # Field is missing interface_start, interface_length or both, skip + it. next; } if ($self->datatype->[$_] =~ /^(?:TINYINT|MEDIUMINT|SMALLINT|INT|INTEGER|BIGINT|FLOAT|DOUB +LE)$/ ) { $hr_returnvalue->{$CurrentColumnName} = 0 + substr( $textinput, $self->flatfield_start->[$_], $self->flatfield_length-> +[$_] ); }else { $hr_returnvalue->{$CurrentColumnName} = substr( $textinput, $self->flatfield_start->[$_], $self->flatfield_length +->[$_] ); $hr_returnvalue->{ $CurrentColumnName } =~ s/^\s*(.*?)\s*$/$1/; # Trim whitespace } # Fill empty fields with that field's default value, if such a value +is defined

    That should reduce the size of the final data structure significantly if there are many numeric fields.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      I implemented your idea with a slight variation (moved the decimal-correction in the numeric data branch and put the check for empty fields in the non-numeric branch:
      if ($self->datatype->[$_] =~ /^(?:TINYINT|MEDIUMINT|SMALLINT|I +NT|INTEGER|BIGINT|FLOAT|DOUBLE)$/) { $hr_returnvalue->{$CurrentColumnName} = 0 + substr ($texti +nput, $self->flatfield_start->[$_], $self->flatfield_length->[$_]);# +create a numeric value. # Decimal-correction if ($self->decimals->[$_] > 0 and defined $hr_returnvalue- +>{$CurrentColumnName}) { $hr_returnvalue->{$CurrentColumnName} /= 10**$self->de +cimals->[$_]; } } else { $hr_returnvalue->{$CurrentColumnName} = substr ($textinput +, $self->flatfield_start->[$_], $self->flatfield_length->[$_]); $hr_returnvalue->{$CurrentColumnName} =~ s/^\s*(.*?)\s*$/$ +1/; # Trim whitespace # Fill empty fields with that field's default value, if su +ch a value is defined if ($hr_returnvalue->{$CurrentColumnName} eq "") { if (defined $self->standadefaultard->[$_]) { if ($self->datatype->[$_] =~ /^(?:CHAR|VARCHAR|DAT +E|TIME|DATETIME)$/) { $hr_returnvalue->{$CurrentColumnName} = sprint +f ("%s", $self->default->[$_]); } else { $hr_returnvalue->{$CurrentColumnName} = $self- +>default->[$_]; } } else { # Remove empty field delete $hr_returnvalue->{$CurrentColumnName}; } } }
      but the idea is sound. Devel::Size now reports the returned data-structure to be 385251506 bytes, which, for some reason is smaller than the data-structure retrieved from the db...I'll have to look at things more closely to figure out why that is.
        the returned data-structure to be 385251506 bytes, which, for some reason is smaller than the data-structure retrieved from the db

        Perhaps the DBI code doesn't trim leading/trailing spaces on string fields?


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Reading (the same) data in different ways & memory usage
by jwkrahn (Abbot) on Apr 20, 2011 at 03:59 UTC

    This doesn't answer your question but:

    sub ReadData ($$) { my ($self, $filename) = @_; my $ar_returnvalue = []; if (!-e "$filename") { Carp::carp("File [$filename] does not exist"); return undef; } open (FLATFILE, '<', $filename) or Carp::croak("Cannot open file [ +$filename]"); while (<FLATFILE>) { chomp; push (@{$ar_returnvalue}, Interfaces::FlatFile::ReadRecord($se +lf, $_)); } close (FLATFILE); return $ar_returnvalue; } ## end sub ReadData ($$)

    You are using prototypes but prototypes were introduced to allow programmers to imitate Perl's built-in functions, not for user code per se.    FMTEYEWTK on Prototypes in Perl

    You are testing for the existence of a file twice, first with stat and then with open.    In the stat test you are unnecessarily copying the file name to a string before testing it.    What's wrong with always quoting "$vars"?

    You should include the $! variable in your error messages so you know why they failed.



    $hr_returnvalue->{$CurrentColumnName} =~ s/^\s*(.*?)\s*$/$1/; + # Trim whitespace

    That is usually written as:

    s/^\s+//, s/\s+$// for $hr_returnvalue->{$CurrentColumnName}; + # Trim whitespace

    Which avoids unnecessary substitution.



    if ($self->datatype->[$_] =~ /^(?:CHAR|VARCHAR|DATE|TI +ME|DATETIME)$/) { $hr_returnvalue->{$CurrentColumnName} = sprintf (" +%s", $self->standaard->[$_]); } else { $hr_returnvalue->{$CurrentColumnName} = $self->sta +ndaard->[$_]; }

    What is the sprintf doing that the simple assignment is not doing?    It looks like you don't need this test at all.

      And so I'm learning new things every day :)

      I have experienced difficulties with the @ and % prototypes, as I read in the link you provided, and stopped using those. The scalar prototypes are now purely used to enforce the right amount of arguments (as an added bonus, perl implicitly coerces any arguments supplied to scalars...this will definitely mess things up, but if you were supplying non-scalars to this function, that was bound to happen anyway, so I'm not worried about that much).

      It seems that open accepts one thing besides strings as the EXPR-argument (2nd or 3rd if a MODE is supplied), and that is a reference to a scalar to be used as an in-memory file. Even though this is not the case here, I've removed the stat-check, since open will give an error anyway if the file doesn't exist. Also $! has been included in the errormessage, should open fail.

      How does one substitution with capture compare to doing 2 substitutions without capture? I would have to benchmark this to figure out which is faster.

      The sprintf seems out of place, though, as with all user-supplied data, it's not certain that the default-values (looks like I missed one when translating "standaard" to "default") for (VAR)CHAR fields actually contains a string-value. However, this can also be done just using ""'s.

        How does one substitution with capture compare to doing 2 substitutions without capture?

        In your example:

        $hr_returnvalue->{$CurrentColumnName} =~ s/^\s*(.*?)\s*$/$1/; + # Trim whitespace

        The regular expression /^\s*(.*?)\s*$/ will always match, regardless if there is or is not whitespace present, so the substitution will always be done.

        In my example:

        s/^\s+//, s/\s+$// for $hr_returnvalue->{$CurrentColumnName}; + # Trim whitespace

        The regular expressions /^\s+/ and /\s+$/ will only match if there is whitespace present and so the substitution will only be done on the occurrence of whitespace.

        Running a benchmark would be good start, and you should try to use data similar to, or the actual data that your program uses.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://900094]
Approved by moritz
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (2)
As of 2024-04-26 02:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found