Alternate for "open"

ravi45722 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Alternate for "open" by 1nickt (Canon) on Nov 16, 2015 at 06:44 UTC
Please show the code you are using to open and read your file. open() shouldn't use up your memory, but reading the file might. Perhaps you are trying to read it all into memory at once? That will cause problems with large files; better to just handle one line at a time in memory. Try something like: `my $filepath = '/path/to/huge/file'; open my $fh, '<:encoding(UTF-8)', $filepath or die $!; while ( my $line = <$fh> ) { # process one line at a time } close $fh or die $!;` [download] HTH! The way forward always starts with a minimal test.	[reply] [d/l]
Re^2: Alternate for "open" by ravi45722 (Pilgrim) on Nov 16, 2015 at 08:16 UTC
`foreach $file (@cdr_list) { chomp $file; open (FP,"$file") or die "Could not open $file\n"; $first=1; while ($line=<FP>){ chomp $line;` [download] Here is the code I am using to read the file. Actually I am running this program in server which contains 64GB of RAM. Whenever the program started running the memory usage jumps to 98% around from 25-30%. I think it's because of the files. I correct or I have to concentrate on other parts???	[reply] [d/l]
Re^3: Alternate for "open" by Corion (Patriarch) on Nov 16, 2015 at 08:27 UTC
You could try to confirm your assumption by running the above minimal code, only reading the file and doing nothing else.	[reply]
Re^3: Alternate for "open" by 1nickt (Canon) on Nov 16, 2015 at 15:18 UTC
Well, with the information you've added (the RAM on your box, the size of the files) it's clear that opening a file is not going to use up all your memory. Did you try Corion's suggestion and remove all code except the loop through the files and the `open()` and `close()` statements? From here it looks a lot like it's the rest of your code (that you didn't share) that is causing the problem. Once you've verified for yourself that it's not the `open()` call, you'll have to share the actual code you're using if you would like to get help. The way forward always starts with a minimal test.	[reply] [d/l] [select]
Re^4: Alternate for "open" by ravi45722 (Pilgrim) on Nov 17, 2015 at 04:53 UTC
Re^5: Alternate for "open" by Corion (Patriarch) on Nov 17, 2015 at 07:35 UTC
Re^3: Alternate for "open" by Preceptor (Deacon) on Nov 16, 2015 at 17:24 UTC
Your lack of `use strict; use warnings;` is likely hiding a problem from you	[reply] [d/l]
Re: Alternate for "open" by 2teez (Vicar) on Nov 16, 2015 at 06:54 UTC
Hi ravi45722 `..I am dealing with bigger files. If I use open its occupying my primary memory..` How is that so? Have you profiled your code? More so, I didn't see how you are using open in your code as shown in your post. If I didn't misunderstood your requirement, there are several alternative to dealing (data munging) with large files. File::Slurper, Path::Tiny and others If you tell me, I'll forget. If you show me, I'll remember. if you involve me, I'll understand. --- Author unknown to me	[reply]
Re: Alternate for "open" by Preceptor (Deacon) on Nov 16, 2015 at 17:24 UTC
`open` doesn't consume significant amount of memory. It will be something else. I would suggest usual culprits in this scenario are: `foreach ( <$file_handle>) {` which reads the whole file into an array before iterating. You should use `while` instead Too wide scope on something that's getting updated as part of your processing - and thus steadily growing as each file is processed. `use strict; use warnings` are just generally a good idea - the snippet you quote doesn't seem to suggest it's doing that. Of course - without some SAMPLE CODE - we can only speculate as to what's causing your problem. It isn't `open` though.	[reply] [d/l] [select]
Re^2: Alternate for "open" by ravi45722 (Pilgrim) on Nov 17, 2015 at 05:02 UTC
Ya, Here is the clue. It's hashes. I am building a hash to store all my result values. The output is two excel books (3g data, 2g data) Each contains 23 sheets. Each sheet contain 11,222 (31 columns,362 rows) as an average (Not Exactly). Is it needed 64 Gb RAM??? If it need that much how can we reduce it???	[reply]
Re^3: Alternate for "open" by 1nickt (Canon) on Nov 17, 2015 at 07:47 UTC
You've now been given several suggestions: `use strict;` `use warnings;` Test by simplifying the script so that it only opens the files Make sure to use `while` not `foreach` to read your filehandles Post verbatim snippets of the code here so it can be reviewed Which of these have you done? In particular, does your code contain `use strict;` and `use warnings;`? It's not an completely unreasonable idea to try to measure the memory footprint of your hash, but most monks here would not do that to find the problem. Better to simplify your code so you can identify the problem. Just remove everything until it runs properly, then start adding stuff back in. If it is a really large and ugly codebase, take the opportunity to refactor and move code out into modules. This is better practise for many reasons and will help you do this kind of debugging by making it easy to use and not use parts of the code. You could also: If you suspect the hash is getting too big, comment out the code that populates it, run the program, and see if there's a difference. Try running the program on only one file and see if there's a difference. Try running the program on lots of very small files and see if there's a difference. Consider loading your file data into a real database such as SQLite and working from there. Look for memory leaks with Test::LeakTrace There, now you have a bunch more suggestions. It will be nice to hear back from you when you've tried some of them and you are still stuck. The way forward always starts with a minimal test.	[reply] [d/l] [select]
Re^4: Alternate for "open" by ravi45722 (Pilgrim) on Nov 17, 2015 at 08:29 UTC
Re^5: Alternate for "open" by Preceptor (Deacon) on Nov 17, 2015 at 11:34 UTC
Some notes below your chosen depth have not been shown here
Re^5: Alternate for "open" by Preceptor (Deacon) on Nov 18, 2015 at 14:41 UTC
Re^3: Alternate for "open" by Preceptor (Deacon) on Nov 17, 2015 at 11:38 UTC
Depends how inefficiently you're storing the data - you can _certainly_ incur overheads - for example, XML is around 10x the memory footprint of the file at rest. But at least now we've moved on from blaming `open` - check what you're inserting into the hash. How many key/value pairs? Are you creating nested data structures? (hash of arrays, etc.)? Because all these things add up - you _can_ expect the memory required to be larger than the raw input.	[reply] [d/l]
Re: Alternate for "open" by u65 (Chaplain) on Nov 16, 2015 at 11:38 UTC
Are your data files binary or text? Can you show us a small sample input file? Update: I see by your code now that the files are text. Is it possible that they have such long lines that the read via the diamond oprator is taking a huge chunk of memory? Update 2: Duh, I'm still asleep--please forgive my incoherent rambling about things I failed to notice in the OP.	[reply]
Re: Alternate for "open" by 1nickt (Canon) on Nov 17, 2015 at 15:02 UTC
Hey there, thanks for posting that. You still haven't posted the relevant code portion, so we're still guessing here. (If your code is "too complicated" for you to be able to copy and paste 10 or 20 lines that contain the functionality you suspect for the memory hogging, then that's an indication that you should simplify the structure of the code). I'm not a sysadmin but it seems to me that if your 64GB RAM server is running at 75% memory usage before you start, it's working rather hard. (On the other hand 25% of 64Gb is still a lot and you should be able to process any number of text files if you code it right.) Currently my busiest server which runs three different daemons forking processes, scraping websites, loading and processing data, and pushing the data to external apis, is only using about 6GB RAM. So looking at the rest of the data you posted, it seems more likely than ever that you need a database for your data. Basically your hash is a database, but it's not up to the task. If you have 3800 files, and if you need all the data from all the files to accomplish your task, and if reading and temporarily processing a file can add 4Mb RAM usage, you really need to use a database. You probably already have a RDBMS on the server, but I'd start with SQLite anyway for its light memory footprint and ease of use. It's very unlikely that you need to hold all the data in memory to do your work, and there are many good reasons why you shouldn't. For example, when you create a hash (or any data structure) in Perl, that memory is not released until the variable goes out of scope. So if you are creating a global hash to store the file data, and then reading from the hash once (for example, to add to an aggregate hash), and then not using the hash again, you're not getting that memory back before the program finishes. This can cause you to run out of memory quickly. So if your program looks anything like: `#!/usr/bin/perl use strict; use warnings; my %h1; my %h2; #etc # a bunch of code here to setup and prepare, # maybe find the list of files foreach my $file ( @files ) { # open ... # split ... $h1{ $file } = $splitted[1]; $h2{ $file } = $splitted[2]; # ... and so on my $res1 = my_func( %h1 ); # don't do this my $res2 = my_func( %h2 ); # ... if ( some condition ) { # do something with $res1 and $res2 } # continue, # lots of code # working hard # but never using the hashes again }` [download] ... then you are allocating memory to the hashes before you need to (though they will be empty) and keeping the memory allocated long after the usefulness of the hash has expired. So declare and use your variables in the smallest possible scope. Also don't pass around actual data structures (as I do in the above example, to show bad practise), because then Perl has to make another copy of the data. Pass references to them, ie use: `my $res = my_func( \%hash ); # not: my_func( %hash );` [download] To take that even further, consider off-loading the work of processing the files to another program, so that all system resources are released when the processing is done. If the above tips seem random and maybe unrelated, it's because you haven't shared the code, so I'm just throwing out random and maybe unrelated tips. The way forward always starts with a minimal test.	[reply] [d/l] [select]
Re^2: Alternate for "open" by ravi45722 (Pilgrim) on Nov 18, 2015 at 04:24 UTC
Sorry, I tested it on the local server. I cant made all my experiments in the main server. So, forget about the starting memory usage & think about the hash size its growing. As you told i am posting my peice of code which i am using for reading the files. 1) I am not passing any hashes to functions. 2) I cant release the scope of that hash because I need it to write an excel at the end of the code foreach $file (@cdr_list) { chomp $file; my $memgthy = `free -m \| awk 'NR==2{printf " %.2f%",\$3*100/\$2 }' +`; print "Memory Usage : ",$memgthy,$/; open (FP,"$file") or die "Could not open $file\n"; $first=1; while ($line=<FP>){ chomp $line; if ($first==1){ ($sgsn_id,$x,$time,$x)=split(/\,/,$line); push(@sgsn_list,$sgsn_id) unless $seen_sgs{$sgsn_id}++; $cdr_date=substr($time,0,10); $single_day=substr($time,0,8); push (@date_list,$cdr_date) unless $seen_cdr{$cdr_date}++; push (@single_day_list,$single_day) unless $seen{$single_d +ay}++; $first++; } else{ #FAIL_IN_RAU_MME_TO_2GSGSN,0; $cmp=substr(reverse($line),0,1); if ($cmp eq ";") { ($var,$value,$x)=split(/\,/,$line); push (@variable_list,$var) unless $seen_var{$var}++; $ass_val{$var}=$value; $pap_data{$cdr_date}{$sgsn_id}{$pap_id}=0; if ($mcc ne "" && length($mcc)<4){ $mcc_data{$cdr_date}{$sgsn_id}{$mcc}=0; } if ($mnc ne ""){ $mnc_data{$cdr_date}{$sgsn_id}{$mcc}{$mnc}=0; } if ($rac ne ""){ $rac_data{$cdr_date}{$sgsn_id}{$mcc}{$mnc}{$rac}=0 +; } if ($nsei ne ""){ $nsei_data{$cdr_date}{$periodic_duration}{$sgsn_id +}{$pap_id}{$nsei}=0; } if ($nsvci ne ""){ $nsvci_data{$cdr_date}{$periodic_duration}{$sgsn_i +d}{$pap_id}{$nsei}{$nsvci}=0; } foreach $var (@variable_list){ $value=$ass_val{$var}; if ($var eq "RTT_DUR_ATTACH_MIN" \|\| $var eq "RTT_D +UR_ATTACH_MAX" \|\| $var eq "PEAK_GB_PDP_Cont" \|\| $var eq "PEAK_ATTACH_ +GB_USERS" \|\| $var eq "PEAK_ACTIVE_SUBS_PER_PAPU" \|\| $var eq "PEAK_ACT +IVE_GB_PDP_CONTEXTS" \|\| $var eq "DUR_MO_PDP_MOD_MIN" \|\| $var eq "DUR_ +MO_PDP_MOD_MAX" \|\| $var eq "PEAK_ATTACH_IU_USERS" \|\| $var eq "PEAK_AC +TIVE_IU_PDP_CONTEXTS" \|\| $var eq "PEAK_IU_PDP_CONT" \|\| $var eq "PEAK_ +LOAD_RATE_OF_OBJECT" \|\| $var eq "PEAK_GB_PDP_CONT"){ $sgsn_name{$single_day}{$sgsn_id}{$var}{$value +}=0; $sgsn_val{$cdr_date}{$sgsn_id}{$var}{$value}=0 +; $pap_val{$cdr_date}{$sgsn_id}{$pap_id}{$var}{$ +value}=0; $mcc_val{$cdr_date}{$sgsn_id}{$mcc}{$mnc}{$var +}{$value}=0; $rac_val{$cdr_date}{$sgsn_id}{$mcc}{$mnc}{$rac +}{$var}{$value}=0; } else { if ($mcc == 405){ $circle_val{$single_day}{$mnc}{$var}=$valu +e + $circle_val{$single_day}{$mnc}{$var}; } $sgsn_name{$single_day}{$sgsn_id}{$var}=$value + + $sgsn_name{$single_day}{$sgsn_id}{$var}; $sgsn_val{$cdr_date}{$sgsn_id}{$var}=$value + +$sgsn_val{$cdr_date}{$sgsn_id}{$var}; $pap_val{$cdr_date}{$sgsn_id}{$pap_id}{$var}=$ +value + $pap_val{$cdr_date}{$sgsn_id}{$pap_id}{$var}; $mcc_val{$cdr_date}{$sgsn_id}{$mcc}{$mnc}{$var +}=$value + $mcc_val{$cdr_date}{$sgsn_id}{$mcc}{$mnc}{$var}; $rac_val{$cdr_date}{$sgsn_id}{$mcc}{$mnc}{$rac +}{$var}=$value + $rac_val{$cdr_date}{$sgsn_id}{$mcc}{$mnc}{$rac}{$var +}; if ($var eq "IP_NSVC_PASSED_DATA_IN_BYTES"){ $nsvc_val{$cdr_date}{$periodic_duration}{$ +sgsn_id}{$pap_id}{$nsei}{$nsvci}{$var}=$value + $nsvc_val{$cdr_date}{ +$periodic_duration}{$sgsn_id}{$pap_id}{$nsei}{$nsvci}{$var}; } } } @variable_list=(); %seen_var=(); %ass_val=(); $pap_id=""; $mcc=""; $mnc=""; $rac=""; $nsei=""; $nsvci=""; } else { ($var,$value,$x)=split(/\,/,$line); if ($var eq "PAPU_INDEX"){ $pap_id=$value; }elsif ($var eq "MCC" \|\| $var eq "IU_RA_MCC"){ $mcc=$value; }elsif ($var eq "MNC" \|\| $var eq "IU_RA_MNC"){ $mnc=$value; }elsif ($var eq "RAC"){ $rac=$value; }elsif ($var eq "NSEI"){ $nsei=$value; }elsif ($var eq "NSVCI"){ $nsvci=$value; }elsif ($var eq "PERIOD_DURATION"){ $periodic_duration=$value; push (@periodic_list,$periodic_duration) unless $s +een_dur{$periodic_duration}++; }elsif ($var eq "OBJECT_NAME"){ $object_name=$value; $object_data{$cdr_date}{$sgsn_id}{$object_name}=0; }elsif ($var eq "OBJECT_INDEX"){ $object_index=$value; $objectindex_data{$cdr_date}{$sgsn_id}{$object_nam +e}{$object_index}=0; }elsif ($var eq "PEAK_LOAD_RATE_OF_OBJECT"){ $peak_load_data{$cdr_date}{$sgsn_id}{$object_name} +{$object_index}{$var}{$value}=0; }elsif ($var eq "AVE_LOAD_RATE_SUM" \|\| $var eq "AVE_LO +AD_RATE_DEN"){ $peak_load_data{$cdr_date}{$sgsn_id}{$object_name} +{$object_index}{$var}= $peak_load_data{$cdr_date}{$sgsn_id}{$object_n +ame}{$object_index}{$var} + $value; } else { push (@variable_list,$var) unless $seen_var{$var}+ ++; $ass_val{$var}=$value; } } } } close (FP); FP->flush(); } my $total_size = total_size(\%val); print "\%val :",$total_size,$/; my $total_size = total_size(\%seen_cdr); print "\%seen_cdr :",$total_size,$/; my $total_size = total_size(\%seen_var); print "\%seen_var :",$total_size,$/; my $total_size = total_size(\%pap_data); print "\%pap_data :",$total_size,$/; my $total_size = total_size(\%mcc_data); print "%mcc_data :",$total_size,$/; my $total_size = total_size(\%mnc_data); print "\%mnc_data :",$total_size,$/; my $total_size = total_size(\%sgsn_val); print "\%sgsn_val :",$total_size,$/; my $total_size = total_size(\%sgsn_name); print "\%sgsn_name :",$total_size,$/; my $total_size = total_size(\%pap_val); print "\%pap_val :",$total_size,$/; my $total_size = total_size(\%mcc_val); print "\%mcc_val :",$total_size,$/; my $total_size = total_size(\%rac_val); print "\%rac_val :",$total_size,$/; my $total_size = total_size(\%nsei_data); print "\%nsei_data :",$total_size,$/; my $total_size = total_size(\%nsvci_data); print "\%nsvci_data :",$total_size,$/; my $total_size = total_size(\%nsvc_val); print "\%nsvc_val :",$total_size,$/; my $t1 = Benchmark->new; my $td = timediff($t1, $t0); print "the code took:",timestr($td),"\n"; [download]	[reply] [d/l]


Perl: the Markov chain saw
	PerlMonks

Alternate for "open"

Update:

%rac_val :4465260 This is equal to 4Mb. I got this for 5 files. But I have nearly 3800 files to process. This is what my memory is eating. How to manage this????