Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Alternate for "open"

by ravi45722 (Pilgrim)
on Nov 16, 2015 at 04:52 UTC ( [id://1147775]=perlquestion: print w/replies, xml ) Need Help??

ravi45722 has asked for the wisdom of the Perl Monks concerning the following question:

-rw-r--r--. 1 mala mala 21M Mar 23 2015 SGSN225374.1.201412260345.20 +6040 -rw-r--r--. 1 mala mala 21M Mar 23 2015 SGSN225374.1.201412260345.20 +6059 -rw-r--r--. 1 mala mala 21M Mar 23 2015 SGSN225374.1.201412260345.81 +8393 -rw-r--r--. 1 mala mala 21M Mar 23 2015 SGSN225374.1.201412260345.81 +8410 -rw-r--r--. 1 mala mala 19M Mar 23 2015 SGSN225374.1.201412260345.81 +8411

I am dealing with bigger files. If I use open its occupying my primary memory. Is there any way to do by not occupying my memory.

FAIL_IN_RAU_MME_TO_2GSGSN,0; 4,12017,184,10373,405,25,100,0,0,0,5,0,25,64000,64000, INT_ID,88320, PERIOD_START_TIME,20141226034500, PERIOD_STOP_TIME,20141226040003, PERIOD_DURATION,15, PAPU_INDEX,4, LAC,12017, RAC,184, CI,10373, SUCC_GPRS_ATTACH,1, FAIL_GPRS_ATTACH,0, SUCC_COMBINED_ATTACH,0, FAIL_COMBINED_ATTACH,0, SUCC_IMSI_ATTACH,0, FAIL_IMSI_ATTACH,0, GENERAL_UNDEF_ATTACH_FAILURE,0, SUCC_INTRA_PAPU_RA_UPDAT,0, FAIL_INTRA_PAPU_RA_UPDAT,0, SUCC_INTRA_PAPU_RA_LA_UPDAT,0, FAIL_INTRA_PAPU_RA_LA_UPDAT,0, SUCC_INTRA_PAPU_RA_UPDAT_IMSI,0, . . . . FAIL_IN_RAU_MME_TO_2GSGSN,0; 4,12017,184,10383,405,25,100,0,0,0,5,0,25,64000,64000,

The above is sample data of my CDR. In that I need to read the line  FAIL_IN_RAU_MME_TO_2GSGSN,0; for that

$cmp=substr(reverse($line),0,1); if ($cmp eq ";") {

But that event occurs for an average of 400 lines once. Its there any better way to do that??? Thanks for every answer in Advance.

Update:

As you are suggested I change the code into pieces and run the file opening & reading script (part of my original code) with a sample data (5files). Here I am printing the size of the hashes and memory usage(RAM) while reading files. As you said it's not taking any memory for reading files. But for hashes its using huge memory

%rac_val :4465260 This is equal to 4Mb. I got this for 5 files. But I have nearly 3800 files to process. This is what my memory is eating. How to manage this????

Output: perl generate_nsn_sgsn_report.pl 20141226 Memory Usage : 77.70% #at the starting stage the + usage of RAM 120 120 120 120 120 120 120 120 120 120 Rac Val :120 120 120 120 Memory Usage : 77.70% #Checking the ram usage after opening every fi +le Memory Usage : 77.71% Memory Usage : 77.71% Memory Usage : 77.72% Memory Usage : 77.75% %val :120 %seen_cdr :212 %seen_var :4152 %pap_data :1858 %mcc_data :629 %mnc_data :1085 %sgsn_val :102895 %sgsn_name :102893 %pap_val :699127 %mcc_val :231276 %rac_val :4465260 %nsei_data :120 %nsvci_data :120 %nsvc_val :120 the code took:49 wallclock secs (48.76 usr 0.06 sys + 0.01 cusr 0.0 +2 csys = 48.85 CPU)

Replies are listed 'Best First'.
Re: Alternate for "open"
by 1nickt (Canon) on Nov 16, 2015 at 06:44 UTC

    Please show the code you are using to open and read your file.

    open() shouldn't use up your memory, but reading the file might. Perhaps you are trying to read it all into memory at once? That will cause problems with large files; better to just handle one line at a time in memory.

    Try something like:

    my $filepath = '/path/to/huge/file'; open my $fh, '<:encoding(UTF-8)', $filepath or die $!; while ( my $line = <$fh> ) { # process one line at a time } close $fh or die $!;
    HTH!

    The way forward always starts with a minimal test.
      foreach $file (@cdr_list) { chomp $file; open (FP,"$file") or die "Could not open $file\n"; $first=1; while ($line=<FP>){ chomp $line;

      Here is the code I am using to read the file. Actually I am running this program in server which contains 64GB of RAM. Whenever the program started running the memory usage jumps to 98% around from 25-30%. I think it's because of the files. I correct or I have to concentrate on other parts???

        You could try to confirm your assumption by running the above minimal code, only reading the file and doing nothing else.

        Well, with the information you've added (the RAM on your box, the size of the files) it's clear that opening a file is not going to use up all your memory.

        Did you try Corion's suggestion and remove all code except the loop through the files and the open() and close() statements?

        From here it looks a lot like it's the rest of your code (that you didn't share) that is causing the problem. Once you've verified for yourself that it's not the open() call, you'll have to share the actual code you're using if you would like to get help.

        The way forward always starts with a minimal test.

        Your lack of use strict; use warnings; is likely hiding a problem from you

Re: Alternate for "open"
by 2teez (Vicar) on Nov 16, 2015 at 06:54 UTC

    Hi ravi45722

    ..I am dealing with bigger files. If I use open its occupying my primary memory..

    How is that so? Have you profiled your code?
    More so, I didn't see how you are using open in your code as shown in your post.

    If I didn't misunderstood your requirement, there are several alternative to dealing (data munging) with large files. File::Slurper, Path::Tiny and others

    If you tell me, I'll forget.
    If you show me, I'll remember.
    if you involve me, I'll understand.
    --- Author unknown to me
Re: Alternate for "open"
by Preceptor (Deacon) on Nov 16, 2015 at 17:24 UTC

    open doesn't consume significant amount of memory. It will be something else. I would suggest usual culprits in this scenario are:

    • foreach ( <$file_handle>) { which reads the whole file into an array before iterating. You should use while instead
    • Too wide scope on something that's getting updated as part of your processing - and thus steadily growing as each file is processed.
    • use strict; use warnings are just generally a good idea - the snippet you quote doesn't seem to suggest it's doing that.

    Of course - without some SAMPLE CODE - we can only speculate as to what's causing your problem. It isn't open though.

      Ya, Here is the clue. It's hashes. I am building a hash to store all my result values. The output is two excel books (3g data, 2g data) Each contains 23 sheets. Each sheet contain 11,222 (31 columns,362 rows) as an average (Not Exactly). Is it needed 64 Gb RAM??? If it need that much how can we reduce it???

        You've now been given several suggestions:

        • use strict;
        • use warnings;
        • Test by simplifying the script so that it only opens the files
        • Make sure to use while not foreach to read your filehandles
        • Post verbatim snippets of the code here so it can be reviewed
        Which of these have you done?

        In particular, does your code contain use strict; and use warnings;?

        It's not an completely unreasonable idea to try to measure the memory footprint of your hash, but most monks here would not do that to find the problem. Better to simplify your code so you can identify the problem.

        Just remove everything until it runs properly, then start adding stuff back in. If it is a really large and ugly codebase, take the opportunity to refactor and move code out into modules. This is better practise for many reasons and will help you do this kind of debugging by making it easy to use and not use parts of the code.

        You could also:

        • If you suspect the hash is getting too big, comment out the code that populates it, run the program, and see if there's a difference.
        • Try running the program on only one file and see if there's a difference.
        • Try running the program on lots of very small files and see if there's a difference.
        • Consider loading your file data into a real database such as SQLite and working from there.
        • Look for memory leaks with Test::LeakTrace

        There, now you have a bunch more suggestions. It will be nice to hear back from you when you've tried some of them and you are still stuck.

        The way forward always starts with a minimal test.

        Depends how inefficiently you're storing the data - you can _certainly_ incur overheads - for example, XML is around 10x the memory footprint of the file at rest.

        But at least now we've moved on from blaming open - check what you're inserting into the hash. How many key/value pairs? Are you creating nested data structures? (hash of arrays, etc.)? Because all these things add up - you _can_ expect the memory required to be larger than the raw input.

Re: Alternate for "open"
by u65 (Chaplain) on Nov 16, 2015 at 11:38 UTC

    Are your data files binary or text? Can you show us a small sample input file?

    Update: I see by your code now that the files are text. Is it possible that they have such long lines that the read via the diamond oprator is taking a huge chunk of memory?

    Update 2: Duh, I'm still asleep--please forgive my incoherent rambling about things I failed to notice in the OP.

Re: Alternate for "open"
by 1nickt (Canon) on Nov 17, 2015 at 15:02 UTC

    Hey there, thanks for posting that. You still haven't posted the relevant code portion, so we're still guessing here.

    (If your code is "too complicated" for you to be able to copy and paste 10 or 20 lines that contain the functionality you suspect for the memory hogging, then that's an indication that you should simplify the structure of the code).

    I'm not a sysadmin but it seems to me that if your 64GB RAM server is running at 75% memory usage before you start, it's working rather hard. (On the other hand 25% of 64Gb is still a lot and you should be able to process any number of text files if you code it right.) Currently my busiest server which runs three different daemons forking processes, scraping websites, loading and processing data, and pushing the data to external apis, is only using about 6GB RAM.

    So looking at the rest of the data you posted, it seems more likely than ever that you need a database for your data. Basically your hash is a database, but it's not up to the task.

    If you have 3800 files, and if you need all the data from all the files to accomplish your task, and if reading and temporarily processing a file can add 4Mb RAM usage, you really need to use a database.

    You probably already have a RDBMS on the server, but I'd start with SQLite anyway for its light memory footprint and ease of use.

    It's very unlikely that you need to hold all the data in memory to do your work, and there are many good reasons why you shouldn't. For example, when you create a hash (or any data structure) in Perl, that memory is not released until the variable goes out of scope. So if you are creating a global hash to store the file data, and then reading from the hash once (for example, to add to an aggregate hash), and then not using the hash again, you're not getting that memory back before the program finishes. This can cause you to run out of memory quickly. So if your program looks anything like:

    #!/usr/bin/perl use strict; use warnings; my %h1; my %h2; #etc # a bunch of code here to setup and prepare, # maybe find the list of files foreach my $file ( @files ) { # open ... # split ... $h1{ $file } = $splitted[1]; $h2{ $file } = $splitted[2]; # ... and so on my $res1 = my_func( %h1 ); # don't do this my $res2 = my_func( %h2 ); # ... if ( some condition ) { # do something with $res1 and $res2 } # continue, # lots of code # working hard # but never using the hashes again }
    ... then you are allocating memory to the hashes before you need to (though they will be empty) and keeping the memory allocated long after the usefulness of the hash has expired.

    So declare and use your variables in the smallest possible scope.

    Also don't pass around actual data structures (as I do in the above example, to show bad practise), because then Perl has to make another copy of the data. Pass references to them, ie use:

    my $res = my_func( \%hash ); # not: my_func( %hash );

    To take that even further, consider off-loading the work of processing the files to another program, so that all system resources are released when the processing is done.

    If the above tips seem random and maybe unrelated, it's because you haven't shared the code, so I'm just throwing out random and maybe unrelated tips.

    The way forward always starts with a minimal test.

      Sorry, I tested it on the local server. I cant made all my experiments in the main server. So, forget about the starting memory usage & think about the hash size its growing. As you told i am posting my peice of code which i am using for reading the files.

      1) I am not passing any hashes to functions. 2) I cant release the scope of that hash because I need it to write an excel at the end of the code

      foreach $file (@cdr_list) { chomp $file; my $memgthy = `free -m | awk 'NR==2{printf " %.2f%",\$3*100/\$2 }' +`; print "Memory Usage : ",$memgthy,$/; open (FP,"$file") or die "Could not open $file\n"; $first=1; while ($line=<FP>){ chomp $line; if ($first==1){ ($sgsn_id,$x,$time,$x)=split(/\,/,$line); push(@sgsn_list,$sgsn_id) unless $seen_sgs{$sgsn_id}++; $cdr_date=substr($time,0,10); $single_day=substr($time,0,8); push (@date_list,$cdr_date) unless $seen_cdr{$cdr_date}++; push (@single_day_list,$single_day) unless $seen{$single_d +ay}++; $first++; } else{ #FAIL_IN_RAU_MME_TO_2GSGSN,0; $cmp=substr(reverse($line),0,1); if ($cmp eq ";") { ($var,$value,$x)=split(/\,/,$line); push (@variable_list,$var) unless $seen_var{$var}++; $ass_val{$var}=$value; $pap_data{$cdr_date}{$sgsn_id}{$pap_id}=0; if ($mcc ne "" && length($mcc)<4){ $mcc_data{$cdr_date}{$sgsn_id}{$mcc}=0; } if ($mnc ne ""){ $mnc_data{$cdr_date}{$sgsn_id}{$mcc}{$mnc}=0; } if ($rac ne ""){ $rac_data{$cdr_date}{$sgsn_id}{$mcc}{$mnc}{$rac}=0 +; } if ($nsei ne ""){ $nsei_data{$cdr_date}{$periodic_duration}{$sgsn_id +}{$pap_id}{$nsei}=0; } if ($nsvci ne ""){ $nsvci_data{$cdr_date}{$periodic_duration}{$sgsn_i +d}{$pap_id}{$nsei}{$nsvci}=0; } foreach $var (@variable_list){ $value=$ass_val{$var}; if ($var eq "RTT_DUR_ATTACH_MIN" || $var eq "RTT_D +UR_ATTACH_MAX" || $var eq "PEAK_GB_PDP_Cont" || $var eq "PEAK_ATTACH_ +GB_USERS" || $var eq "PEAK_ACTIVE_SUBS_PER_PAPU" || $var eq "PEAK_ACT +IVE_GB_PDP_CONTEXTS" || $var eq "DUR_MO_PDP_MOD_MIN" || $var eq "DUR_ +MO_PDP_MOD_MAX" || $var eq "PEAK_ATTACH_IU_USERS" || $var eq "PEAK_AC +TIVE_IU_PDP_CONTEXTS" || $var eq "PEAK_IU_PDP_CONT" || $var eq "PEAK_ +LOAD_RATE_OF_OBJECT" || $var eq "PEAK_GB_PDP_CONT"){ $sgsn_name{$single_day}{$sgsn_id}{$var}{$value +}=0; $sgsn_val{$cdr_date}{$sgsn_id}{$var}{$value}=0 +; $pap_val{$cdr_date}{$sgsn_id}{$pap_id}{$var}{$ +value}=0; $mcc_val{$cdr_date}{$sgsn_id}{$mcc}{$mnc}{$var +}{$value}=0; $rac_val{$cdr_date}{$sgsn_id}{$mcc}{$mnc}{$rac +}{$var}{$value}=0; } else { if ($mcc == 405){ $circle_val{$single_day}{$mnc}{$var}=$valu +e + $circle_val{$single_day}{$mnc}{$var}; } $sgsn_name{$single_day}{$sgsn_id}{$var}=$value + + $sgsn_name{$single_day}{$sgsn_id}{$var}; $sgsn_val{$cdr_date}{$sgsn_id}{$var}=$value + +$sgsn_val{$cdr_date}{$sgsn_id}{$var}; $pap_val{$cdr_date}{$sgsn_id}{$pap_id}{$var}=$ +value + $pap_val{$cdr_date}{$sgsn_id}{$pap_id}{$var}; $mcc_val{$cdr_date}{$sgsn_id}{$mcc}{$mnc}{$var +}=$value + $mcc_val{$cdr_date}{$sgsn_id}{$mcc}{$mnc}{$var}; $rac_val{$cdr_date}{$sgsn_id}{$mcc}{$mnc}{$rac +}{$var}=$value + $rac_val{$cdr_date}{$sgsn_id}{$mcc}{$mnc}{$rac}{$var +}; if ($var eq "IP_NSVC_PASSED_DATA_IN_BYTES"){ $nsvc_val{$cdr_date}{$periodic_duration}{$ +sgsn_id}{$pap_id}{$nsei}{$nsvci}{$var}=$value + $nsvc_val{$cdr_date}{ +$periodic_duration}{$sgsn_id}{$pap_id}{$nsei}{$nsvci}{$var}; } } } @variable_list=(); %seen_var=(); %ass_val=(); $pap_id=""; $mcc=""; $mnc=""; $rac=""; $nsei=""; $nsvci=""; } else { ($var,$value,$x)=split(/\,/,$line); if ($var eq "PAPU_INDEX"){ $pap_id=$value; }elsif ($var eq "MCC" || $var eq "IU_RA_MCC"){ $mcc=$value; }elsif ($var eq "MNC" || $var eq "IU_RA_MNC"){ $mnc=$value; }elsif ($var eq "RAC"){ $rac=$value; }elsif ($var eq "NSEI"){ $nsei=$value; }elsif ($var eq "NSVCI"){ $nsvci=$value; }elsif ($var eq "PERIOD_DURATION"){ $periodic_duration=$value; push (@periodic_list,$periodic_duration) unless $s +een_dur{$periodic_duration}++; }elsif ($var eq "OBJECT_NAME"){ $object_name=$value; $object_data{$cdr_date}{$sgsn_id}{$object_name}=0; }elsif ($var eq "OBJECT_INDEX"){ $object_index=$value; $objectindex_data{$cdr_date}{$sgsn_id}{$object_nam +e}{$object_index}=0; }elsif ($var eq "PEAK_LOAD_RATE_OF_OBJECT"){ $peak_load_data{$cdr_date}{$sgsn_id}{$object_name} +{$object_index}{$var}{$value}=0; }elsif ($var eq "AVE_LOAD_RATE_SUM" || $var eq "AVE_LO +AD_RATE_DEN"){ $peak_load_data{$cdr_date}{$sgsn_id}{$object_name} +{$object_index}{$var}= $peak_load_data{$cdr_date}{$sgsn_id}{$object_n +ame}{$object_index}{$var} + $value; } else { push (@variable_list,$var) unless $seen_var{$var}+ ++; $ass_val{$var}=$value; } } } } close (FP); FP->flush(); } my $total_size = total_size(\%val); print "\%val :",$total_size,$/; my $total_size = total_size(\%seen_cdr); print "\%seen_cdr :",$total_size,$/; my $total_size = total_size(\%seen_var); print "\%seen_var :",$total_size,$/; my $total_size = total_size(\%pap_data); print "\%pap_data :",$total_size,$/; my $total_size = total_size(\%mcc_data); print "%mcc_data :",$total_size,$/; my $total_size = total_size(\%mnc_data); print "\%mnc_data :",$total_size,$/; my $total_size = total_size(\%sgsn_val); print "\%sgsn_val :",$total_size,$/; my $total_size = total_size(\%sgsn_name); print "\%sgsn_name :",$total_size,$/; my $total_size = total_size(\%pap_val); print "\%pap_val :",$total_size,$/; my $total_size = total_size(\%mcc_val); print "\%mcc_val :",$total_size,$/; my $total_size = total_size(\%rac_val); print "\%rac_val :",$total_size,$/; my $total_size = total_size(\%nsei_data); print "\%nsei_data :",$total_size,$/; my $total_size = total_size(\%nsvci_data); print "\%nsvci_data :",$total_size,$/; my $total_size = total_size(\%nsvc_val); print "\%nsvc_val :",$total_size,$/; my $t1 = Benchmark->new; my $td = timediff($t1, $t0); print "the code took:",timestr($td),"\n";

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1147775]
Approved by Paladin
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (7)
As of 2024-04-16 06:06 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found