Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re: Running out of memory...

by vek (Prior)
on Jan 29, 2005 at 19:59 UTC ( [id://426278]=note: print w/replies, xml ) Need Help??


in reply to Running out of memory...

When you do the following...

undef $/; $a = <E>;
...you are reading the entire file into memory.

Try to break the task into smaller pieces. Without knowing what auburn_courses.txt or auburn_courses2.txt contain it's going to be tricky to give specific advice.

-- vek --

Replies are listed 'Best First'.
Re^2: Running out of memory...
by knewter (Novice) on Jan 29, 2005 at 20:18 UTC
    I've changed my code to the following:
    #!/usr/local/bin/perl -w use strict; open IN, '< :raw', $ARGV[ 0 ] or die "$ARGV[ 0 ] : $!"; open OUT, '> :raw', $ARGV[ 1 ] or die "$ARGV[ 1 ] : $!"; my $a = '<!-- rsecftr.htm - Course Sections Table Footer -->'; my $b = '<!-- rsechdr.htm - Course Sections and Course Section Search +Table Header -->'; my $buffer; sysread IN, $buffer, 5800, 5800; do{ ## Move the second half of the buffer to the front. $buffer = substr( $buffer, 5800 ); ## and overwrite it with a new chunk sysread IN, $buffer, 5800, length( $buffer ); ## Apply the regex $buffer =~ s|$b(.*?)$a||g; print $buffer; ## Write out the first half of the buffer syswrite OUT, $buffer, 5800; } until eof IN; close IN; close OUT;
    auburn_courses.txt contains a load of html files all bunched one after the other...I'd like to remove the bits between the footer of one section that I want to see and the header of the next section that I'd like to see. They're delimited by the $a and $b lines.

    Update! All fixed, ignore me

      I realise you have fixed your problem, but I have to say that without seeing your actual data, 5800 is a very strange choice of buffer size?


      Examine what is said, not who speaks.
      Silence betokens consent.
      Love the truth but pardon error.
        I don't exactly remember why I did that now, but it had something to do with the average size of the html file I was slurping...that number's something like twice it I think, so that I'm guaranteed to have the data I need in the middle somewhere...yeah, I don't remember why exactly now, but I'm glad to say that the scraper portion of my project is completed and that the data is happily in a database serving out useful information to people that didn't have it as easily before :)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://426278]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (3)
As of 2024-03-29 14:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found