http://qs321.pair.com?node_id=1000500

Uree has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

First of all, and this being my first post in the place, hi everyone.
Although I've definitely been relying on this site for some time now, this is the first question I ask (which, let me say, should speak well about the community).

Now, Im using LWP::UserAgent to download large XML files. Files do download successfully and their content is the expected, the problem is the high memory usage due to their size (>100MB).
To sort this out, I tried to rely on LWP::UserAgent 's "get" function option ":content_file".
Here's my code:

#!/usr/bin/env perl use strict; use warnings; use File::Temp; use LWP::UserAgent; #use HTTP::Request; do_task(); sub do_task { my $ua = LWP::UserAgent->new( 'ssl_opts' => { 'verify_hostname' => 0 } ); $ua->show_progress(1); my @urls = ( "http://linkToAFat.xml", ); foreach my $url ( @urls) { my ($fh, $path) = File::Temp::tempfile(DIR => '/tmp/my_tmp'); $ua->get($url, ":content_file" => $path); #my $request = HTTP::Request->new(GET => $url); #my $response = $ua->request($request, $path); } }
(Commented lines are additional ways I've tried to work the high mem usage around)

My problem is that, apparently, although I'm using the ":content_file" option, files' content DOES still get loaded into memory.

I am quite stuck with this one, so I'd appreciate Monks' almighty support.
Thanks in advance!

Replies are listed 'Best First'.
Re: LWP::UserAgent & memory problems
by BrowserUk (Patriarch) on Oct 23, 2012 at 20:59 UTC

    Take a look at the :content_cb     => \&callback parameter to GET.

    Essentially, you supply a subroutine that LWP will call each time a new block of the file arrives, so you can save it to a file of your choice as it arrives, rather than accumulating it in memory and giving it to you in one huge lump.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    RIP Neil Armstrong

      My approach to the issue was that, having tested a few different ways without success, it would actually be me doing something wrong/missing something rather than library's option/func not working as expected.
      For that reason, I stopped exploring different options and focused on spotting what Im missing out.

      Thanks for the suggestion, I'll give it a go.

Re: LWP::UserAgent & memory problems
by daxim (Curate) on Oct 23, 2012 at 17:08 UTC

      Tried that one too. Happens exactly the same.

      Although, my approach to the issue was that, having tested a few different ways without success, it would actually be me doing something wrong/missing something rather than library's option/func not working as expected.
      For that reason, I stopped exploring different options and focused on spotting what Im missing out.

      It doesn't happen for me, which version of LWP are you using?

      $ perl -MDevel::VersionDump -MLWP -e 1
      Perl version: v5.14.1 on MSWin32
      Carp               -    1.26
      Config             - Unknown
      Devel::VersionDump -    0.02
      Exporter           -    5.66
      Exporter::Heavy    -    5.66
      Fcntl              -    1.11
      HTTP::Date         -    6.02
      HTTP::Headers      -    6.00
      HTTP::Message      -    6.03
      HTTP::Request      -    6.00
      HTTP::Response     -    6.03
      HTTP::Status       -    6.03
      LWP                -    6.04
      LWP::MemberMixin   - Unknown
      LWP::Protocol      -    6.00
      LWP::UserAgent     -    6.04
      Storable           -    2.30
      Time::Local        -  1.2300
      URI                -    1.60
      URI::Escape        -    3.31
      XSLoader           -    0.15
      constant           -    1.21
      overload           -    1.13
      strict             -    1.04
      vars               -    1.02
      warnings           -    1.12
      warnings::register -    1.02
      
      
        l$ perl -MDevel::VersionDump -MLWP -e 1 Perl version: v5.14.2 on linux (BREWED) Carp - 1.26 Config - Unknown Devel::VersionDump - 0.02 Exporter - 5.67 Exporter::Heavy - 5.67 Fcntl - 1.11 HTTP::Date - 6.02 HTTP::Headers - 6.05 HTTP::Message - 6.06 HTTP::Request - 6.00 HTTP::Response - 6.04 HTTP::Status - 6.03 LWP - 6.04 LWP::MemberMixin - Unknown LWP::Protocol - 6.00 LWP::UserAgent - 6.04 Storable - 2.39 Time::Local - 1.2300 URI - 1.60 URI::Escape - 3.31 XSLoader - 0.16 constant - 1.21 overload - 1.13 strict - 1.04 vars - 1.02 warnings - 1.12 warnings::register - 1.02

        So, you can actually execute that piece of code to dl a large file (~100MB+) and have it copied directly to a file, without using a portion of memory directly proportional to file's size?
        If that is the case, then I most definitely must be missing something. Can you spot something wrong with my code?

        PS: Didnt know 'Devel::VersionDump'; thanks for the input.

Re: LWP::UserAgent & memory problems
by tobyink (Canon) on Oct 23, 2012 at 19:44 UTC

    You could try using one of the callback-based HTTP modules such as AnyEvent::HTTP - you define a callback function which gets passed the downloaded data in chunks. This function can write the data to an open filehandle. The whole data never needs to be held in memory at once.

    perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'
Re: LWP::UserAgent & memory problems
by talexb (Chancellor) on Oct 23, 2012 at 19:58 UTC

    Could you just use curl to download the file instead? You don't always have to use Perl for stuff.

    Alex / talexb / Toronto

    "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds

      I agree. Perl is a glue language, nothing wrong with controlling through console streams a console app to do the heavy lifting for your script, I/O or CPU or memory wise. Some tasks are better to do in a C/C++ based program than in Perl, but nothing with using Perl to control the C program. I've found that console HTTP downloaders always work better and faster for me than LWP unless very complicated forms or non standard HTTP verbs are required.

      Oh, I probably should have mentioned that the files are later being parsed with XML::LibXML, using an hybrid pull parser - dom tree strategy.
      More Concretely, with XML::LibXML::Reader I implemented a pull parser and then, for every node (these type of nodes are dramatically littler than the whole XML dom tree) I load it into memory and get the data Im interested on with XML::LibXML::XPathContext.

      I apologize if omitting this turned out to be misleading.
      But, fact is the part of the code in charge of the parsing does work well and according to what I expect in terms of mem usage.

      Now, the part that indeed doesnt work as expect is the concrete piece of code of the original post (which I isolated into this single script, for testing purposes)
      The only omitted code there is an array containing the paths to the dl'd files which is being returned by the function and, also, a few more urls @ the urls array.

Re: LWP::UserAgent & memory problems
by trwww (Priest) on Oct 28, 2012 at 17:05 UTC
    I seem to recall LWP::Simple::getstore doing the right thing for me for large files.