Reducing Memory Usage

PerlingTheUK has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Reducing Memory Usage by knoebi (Friar) on Jul 16, 2004 at 07:26 UTC
i don't know what you need to do with your file exactly, except sorting. you could give Tie-File a try. With it you can access every single line in the file, but not the whole file is loaded into the memory. ciao knoebi	[reply]
Re^2: Reducing Memory Usage by PerlingTheUK (Hermit) on Jul 16, 2004 at 07:42 UTC
I have looked at that option briefly but as I have had problems with running the substr function, I believe it is not a practical alternative. I need to compare the 3rd to 11th character (location) of each line with every other line, if these characters are the same, I need to compare a time (12th to 15th) character and sort all of the lines according to that time. I also need to convert the time that is in a strange format every time i read it, so this data preparation is quite time consuming, and I do not want to run it ever single time I need the value. Ciao PerlingTheUK	[reply]
Re^3: Reducing Memory Usage by Gilimanjaro (Hermit) on Jul 16, 2004 at 13:43 UTC
The facts I've found so far: every line is 80 characters exactly the lines need to be grouped by the field at offsets 2..10 the groups need to be sorted by the field at offsets 11..14 this second field is a coded time that needs to be decoded A possible strategy would be to first 'index' the file, by reading it line by line; (untested code follows) `my %index; my $line=0; while (<FILE>) { my ($location,$time) = /^..(.{9})(.{4})/; push @{$index{$location}},[$time,$line]; $line++; }` [download] This would results in a hash keyed on the 'location', with the value being a reference to an array with contain the info you need to sort the lines. This seems to be the minumum amount of info needed to determine the sort order. The next step is to sort the arrays by the time values, you've stored, and fetch the lines in order from the file: (untested code again) `for my $location (keys %index) { my @sorted = sort { $a->[0] <=> $b->[0]} @{$index{$location}}; for my $entry (@sorted) { seek FILE, 81 * $entry->[1], 0; read FILE, $line, 80; print $line,"\n"; } }` [download] This method should be very memory efficient I think, and not too slow either; the biggest slowdown is probably the seeking around in the file. This method works because we know the lengths of records. If we don't we could use the tell function before we read a line, to also store the exact start position of the line in the index...	[reply] [d/l] [select]
Re^4: Reducing Memory Usage by dragonchild (Archbishop) on Jul 16, 2004 at 13:48 UTC
Re: Reducing Memory Usage ( under 10%) by BrowserUk (Patriarch) on Jul 16, 2004 at 09:52 UTC
Okay. This is just a skeleton, but this creates 50,000 Bus objects, and gives each of them 33 x 80-byte timetables. All are individually getable and setable. All fully OO (externally). Total data: 50,000 * 33 * 80 = 125 MB. Total process memory consumed: 140 MB. Adding methods to manipulate the data is just a case of each method calling the get() routine and then splitting the data into it's constituent bits to manipulate. Trading a little time for memory. Read more... (1094 Bytes) Or if you need text-key access to your buses using a hash pushes it to 150 MB. Read more... (1129 Bytes) Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail "Memory, processor, disk in that order on the hardware side. Algorithm, algoritm, algorithm on the code side." - tachyon	[reply] [d/l] [select]
Re: Reducing Memory Usage by mhi (Friar) on Jul 16, 2004 at 09:21 UTC
Since you say that the file size has increased from 5 to 125MB, I'll just guess it won't stop there... So, yes, a Database would be the way to go. If that is not feasible, you might want to create a sort-file from your original data that consists of the sorting criteria in a directly (ascii-)sortable fixed-length format starting at the beginning of the line and the original data afterwards, separated by a delimiter. This file can then be sorted by any simple sort program. (if you're on a unix box or have cygwin available, 'sort' should do the job easily and you can tweak the buffer size it uses for optimum performance on your box. After all, sorting files is exactly what it was written for!) After sorting, just filter out the sorting info and the delimiter again and you have your sorted data.	[reply]
Re^2: Reducing Memory Usage by PerlingTheUK (Hermit) on Jul 16, 2004 at 09:41 UTC
That sounds interesting, but I believe before starting that I will definitely go the database way. The size is likely to come to an end at 150 to 175 MByte. Thank You all for you answers. Anyway I am aware that Perl likes to be slightly !thriftless! when it comes to memory usage. Nevertheless would I like to know if there are any techniques known in Perl to reduce memory usage, (apart from those helping to avoid memory leaks). Does anyone around know any links, documentation, books about this and closely related problems?	[reply]
Re^3: Reducing Memory Usage by husker (Chaplain) on Jul 16, 2004 at 15:38 UTC
Your selected algorithm is the best way to control Perl's memory usage. First, I might suggest that you decode the "wierd" date in your file ONE time, by going through the large file once, and rewriting it to a new file with the "proper" date. Second, if your Perl program is just a sorting thing, (or that is at least a major function of it), then if it's a big enough problem, purchasing a dedicated specialized sort program for your OS might be a better investment. Syncsort is such a product that may fit your needs. There are versions for Windows and for most important flavors of UNIX.	[reply]
Re^4: Reducing Memory Usage by BrowserUk (Patriarch) on Jul 16, 2004 at 16:41 UTC
Re^2: Reducing Memory Usage by danielcid (Scribe) on Jul 16, 2004 at 13:17 UTC
I completely agree with you (mhi). Imagine a few months later, you loading a 200,300 or 400 MB file in the memory... It's crazy! There is so many free databases, like mysql. You should think carefully about it. -DBC	[reply]
Re: Reducing Memory Usage by Jonathan (Curate) on Jul 16, 2004 at 08:43 UTC
Have you thought of using DBD:SQLite? it comes with a self contained database and is said to be rather fast. Might be what you need	[reply]
Re: Reducing Memory Usage by Anonymous Monk on Jul 16, 2004 at 07:26 UTC
Not to put everything to memory. You need database.	[reply]
Re^2: Reducing Memory Usage by PerlingTheUK (Hermit) on Jul 16, 2004 at 07:33 UTC
Yep I know would be right what I want but my company does not like that, as it is believed (and true) to take a lot of time for administration.	[reply]
Re^3: Reducing Memory Usage by DrHyde (Prior) on Jul 16, 2004 at 08:25 UTC
If your company is unwilling to use the right too for the job, then you're out of luck. As for administration - no, it doesn't take much, if the database is dedicated to this program of yours and doesn't allow remote access. And even if it does take "a lot of time" you need to weigh that up against the costs of continuing as you are. And against the costs when your dataset grows even further.	[reply]
Re^3: Reducing Memory Usage by beable (Friar) on Jul 16, 2004 at 08:31 UTC
If your company doesn't want to use a database because it's too expensive, then they should have no problem in deciding for the cheaper option of getting the machine 2GB of RAM so that it can do the job required. 2GB of RAM can't cost much, can it? Not when you compare it to the time taken and expense of administering a database.	[reply]
Re^3: Reducing Memory Usage by JanneVee (Friar) on Jul 16, 2004 at 11:23 UTC
How much time goes to administrate a 125 Mb textfile?	[reply]
Re^3: Reducing Memory Usage by Anonymous Monk on Jul 17, 2004 at 00:54 UTC
Sorry if I'm missing something, but it's EASY AND FREE to setup a mySQL database. I did it on my laptop in under half-an-hour. Then you can create indexes and sort efficiently and yada..yada..yada. Besides, SQL is a helluvalot easier to learn than Perl.	[reply]
Re: Reducing Memory Usage by BrowserUk (Patriarch) on Jul 16, 2004 at 07:42 UTC
Whats the average length of string, and how many are there in your 125 MB file? Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail "Memory, processor, disk in that order on the hardware side. Algorithm, algoritm, algorithm on the code side." - tachyon	[reply] [d/l]
Re^2: Reducing Memory Usage by PerlingTheUK (Hermit) on Jul 16, 2004 at 08:01 UTC
All lines are 80 chars, adding up to about 1,6 to 1.8 million lines.	[reply]
Re^3: Reducing Memory Usage by BrowserUk (Patriarch) on Jul 16, 2004 at 08:09 UTC
And how many (and what type) of objects does that translate to? Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail "Memory, processor, disk in that order on the hardware side. Algorithm, algoritm, algorithm on the code side." - tachyon	[reply] [d/l]
Re^4: Reducing Memory Usage by PerlingTheUK (Hermit) on Jul 16, 2004 at 08:22 UTC
Re^5: Reducing Memory Usage by BrowserUk (Patriarch) on Jul 16, 2004 at 08:59 UTC
Re: Reducing Memory Usage by bunnyman (Hermit) on Jul 16, 2004 at 15:19 UTC
Anything that makes me understand how memory is allocated in scalars/arrays/hashes? Devel::Size	[reply]
Re: Reducing Memory Usage by Gilimanjaro (Hermit) on Jul 16, 2004 at 14:19 UTC
Another approach: (untested code follows) `my @objects; while (<FILE>) { my ($location,$time) = /^..(.{9})(.{4})/; push @objects, bless [ $location, $time, $location.$time, tell(FILE) ], MyObject; } package MyObject; sub overload cmp => sub { $_[0]->[2] cmp $_[1]->[2] }; sub location { return shift->[0] } sub time { return shift->[1] } sub record { seek FILE,shift->[3],0; my $b; read FILE,$b,80; return $b + }` [download] The overload would allow plain old sort to work on the array, and should be pretty fast as the keys to sort on are stored already. The time conversion could possible by done by a function which stores previously converted values in a hash, so you can do a cheap hash lookup instead of an expensive conversion for values you've already seen. You'll need to make sure the filehandle stays open, possibly in the MyObject package so the records can be retreived when they're actually needed</P.	[reply] [d/l]
Re: Reducing Memory Usage by periapt (Hermit) on Jul 20, 2004 at 12:22 UTC
I admit that I don't fully know your situation. However, I would think seriously about using a database if you don't need one. There is a lot of unrelated/unexpected overhead associated with a database. The costs in time, effort and learning curve can be high. That being said, if you need a database, generally, and believe this problem is a good one to convince your management to let you install one, then go for it. On the other hand, this seems to me to be a simple text manipulation problem. You've had a couple of excellent, low footprint solutions posted already. Take another look at them. I assume that you are reading and processing one file at a time. Basically, you need to 1. Use unix sort to sort each file (maybe into a temp file) on characters 2..10 (on Windows, use GNU utils sort, they are native windows ports of unix utilities) 2. using Perl, read in each group of lines and process accordingly. Since the records are already grouped, you would only need to read in the # of lines in a group + 1 ( 80 * (# of lines + 1)). For better performance, you can read in each file in chunks to meet a specified memeory size and process each group in a loop. Another alternative is to 1. read the file using Perl and writing each line to a unique id (pos 2..10) temporary files (maybe decoding pos 11..14 on the way). 2. sort each file on pos 11..14 and if necessary, cat them together to make a single file again. If you name the temp files properly, you can join the groups in any order you desire or need. Of course, none of these options are "sexy" per se but given the file sizes you mentioned, the solutions shouldn't take more than a minute or two to run and they don't take much overhead. Hope this helps PJ unspoken but ever present -- use strict; use warnings; use diagnostics; (if needed)	[reply]


Perl-Sensitive Sunglasses
	PerlMonks

Reducing Memory Usage

unspoken but ever present -- use strict; use warnings; use diagnostics; (if needed)