Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

How to process with huge data's

by arivu198314 (Sexton)
on Oct 04, 2011 at 13:54 UTC ( [id://929550]=perlquestion: print w/replies, xml ) Need Help??

arivu198314 has asked for the wisdom of the Perl Monks concerning the following question:

I have written a script, but the input is huge so my script taking days to finish

Its a kind of network log report

The input lines exceeds 700000

Input: c 1 2 q 2 5 c 3 9 q 6 7 c 2 6 q 4 5 c 1 5 c 8 7 c 2 6 c 6 3 c 4 8 c 8 1 c 3 9 c 1 2 q 4 3 c 8 4

here 'c' means connection, 'q' means query. We need to cumulate all the connections and while reading query we need to answer the query, whether the systems are connected or not. Finally we need to display how many query answered and failed. The answer for the input is 1,3, (q 4 3 connected and remaining queries are failed)

my code is here

my $queryCount=0; my $answeredCount=0; my $connectionText; open(INP, $ARGV[0]); while(<INP>) { chomp($_); $inputtext=$_; if ($inputtext=~m/q (\d+) (\d+)/) { $answeredCount++ if ($connectionText=~m/(?:\b$1\b[^\|]*\b$2\b| +\b$2\b[^\|]*\b$1\b)/); $queryCount++; } elsif ($inputtext=~m/c (\d+) (\d+)/) { my $fnum=$1; my $snum=$2; if ($connectionText=~m/\b$fnum\b[^\|]*\b$snum\b/) {} elsif ($connectionText=~m/\b$fnum\b/ and $connectionText=~m/\b +$snum\b/) { $connectionText=~s/\b$fnum\b(.*?)\|([^\|]*\b$snum\b.*?)(\| +|$)/$fnum.','.$2.','.$1.$3/e; $connectionText=~s/\b$snum\b(.*?)\|([^\|]*\b$fnum\b.*?)(\| +|$)/$snum.','.$2.','.$1.$3/e; } elsif ($connectionText=~m/\b(?:$fnum|$snum)\b/) { $connectionText=~s/\b(?:$fnum|$snum)\b(.*?)\|/$fnum.','.$s +num.','.$1.'|'/e; } else { $connectionText.=$fnum.",".$snum."|"; } $connectionText=~s/,,/,/g; $connectionText=~s/,\|/\|/g; } } close (INP); print "$answeredCount,".($queryCount-$answeredCount)."\n";

Replies are listed 'Best First'.
Re: How to process with huge data's
by suaveant (Parson) on Oct 04, 2011 at 14:48 UTC
    Well, you are using regular expressions heavily, which is a sure way to slow things down in the long haul, especially if they aren't designed properly. The REAL problem I think however, is the way you maintain the list of connections. You just keep appending to the $connectionText string and then doing multiple lookups/replaces on it. As the string gets longer and longer your regexps will get slower and slower. 700,000 short lines should not even really take hours to process unless on a real slow machine.

    My first inclination would be to use a hash to store the connection data, though if you run into memory issues you may have to move to berkeleydb or sqlite or some other lookup tool that uses disk storage.

    I ran the following perl -e 'for(1..700_000){$f{$_} = [1,2]} print "Done\n";sleep 10;' and only saw 170MB of RAM usage, so I'd guess you'd be fine using a hash.

    Whatever you do, rethink your regexp use, you are using a sledgehammer to put in a screw.

                    - Ant
                    - Some of my best work - (1 2 3)

Re: How to process with huge data's
by TomDLux (Vicar) on Oct 04, 2011 at 17:44 UTC

    700,000 lines is a small input file ...OK, a medium sized input file, especially when each line is 6 characters. It should take a few seconds to read and process.

    Using the default variable $_ is great when you have a few lines of code, but when you have a real program, I find it better to use real variables all the time. For one thing, when you extend the code, you may introduce inner or outer loops which re-assign $_, so you are forced to use real variables anyway. And in your case, you explicity specify the variable you are chomping, so there's no saving. And since you assign $_ to $inputtext, why not use that variable right from the while loopo:

    while ( my $input_text = <INP> ) { ....

    Regular expressions are great, when you need them, but they're not always the best solution. I would split the input when I receive it into it's separate components:

    while ( my $line = <INP> ) { chomp $input; my ( $cmd, $fnum, $snum ) = split ' ', $line; if ( $cmd eq 'q' ) { $answered++ if contains_pair( $connectionText, $fnum, $snum ) or contains_pair( $connectionText, $snum, $fnum ); $queries++; } ...

    Encapsulating a search into a routine is better than having it inline, because the name or the routine provides some documentation. As well, splitting a complicated search into a pair of simpler searches is much easier to understand, and can be faster, too.

    I think what you're trying to do is record pairs of numbers$fnum,$snum, and separate the pairs by vertical pipes. This is part of what is slowing your program down. For each line of input, all 700,000 of them, each regular expression searches linearly through the connectionText string... making your program complexity N^2. Instead of storing it as a string, you should use an array of pairs, or a hash of pairs, or a two layer hash.

    But if you WERE going to stick with a string, you can simplify the last three lines by checking for blank components on the simple pair, rather than checking the whole string ... after all, you know after each pass that the connectionText has no half-pairs, so you don't need to check the whole string, only the new addition:

    # instead of ... else { $connectionText.=$fnum.",".$snum."|"; } $connectionText=~s/,,/,/g; $connectionText=~s/,\|/\|/g; # how about ... else { my $pair = join ',', grep { defined } ( $fnum, $snum ); $connectiontext .= $pair . '|'; }

    Given a list, ($fnum, $snum), grep for the defined components, discarding the rest. (This is a place where default variables ARE useful!) join() the resulting list with commas ... that is place commas BETWEEN components. If there is only one component, there's no comma, so no mess to clean up. But in any case, have you encountered this situation, or is it merely fear that makes you handle missing parts. because, you know, you assign $fnum and $snum from regular expressions that match numerals. If there is no numberal, there would be no match, so else() test would fail, and you'd never enter the else block at all.

    I don't think your "trailing pipe or end of string" search (\||$) will ever find anything but a trailing pipe, since you add it with each addition, and preserve it with the changes. So why not simply check for and leave in place the trailing pipe?

    But it would be a WHOLE lot simpler to use data structures instead of strings. If you use an array of pairs, you can grep through the list of pairs to search for those that match:

    push @allPairs, grep { defined } ( $fnum, $snum ); # and at another time .. # find half-pairs, which I think is impossible my @halves = grep { scalar @$_ == 1 && ( $_->[0] == $fnum || $_->[0] = += $snum ) } @allpairs; # find true pairs with a leading fnum my @matches = grep ( scalar @$_ == 2 && $_->[0] == $fnum } @allpairs # or a trailing fnum my @matches = grep ( scalar @$_ == 2 && $_->[1] == $fnum } @allpairs

    Searching the entire array for each line is still an N^2 algorithmn, but at least cleaner than searching the string character by character. Using hashes would definitely be faster.

    It seems you're adding things to what I called 'pairs', to form growning lists. So maybe you want a hash of hashes, or a hash that uses $fnum as the key, and has a n array of $snum as the associated value. If you need to be able to find the sets that include a particular $snum, you could also have similar reverse data structure, keyed on $snum with an array of $fnum that use that $snum. Lots of different ways to generate data structures, depending on your needs, and all of them are better than a long string.

    As Occam said: Entia non sunt multiplicanda praeter necessitatem.

      Thanks for your reply

      After consulting all the replies, I have moved into array in hashes

      Please find my code below, but still i'm struggling lot since it is taking long time to finish this task.

      #!/usr/bin/perl use Data::Dumper; my $queryCount=0; my $answeredCount=0; my @connectionValues; open(INP, $ARGV[0]); while(my $inputtext=<INP>) { chomp($inputtext); my ( $cmd, $fnum, $snum ) = split ' ', $inputtext; if ($cmd eq 'q') { for( my $arrVal=0; $arrVal < @connectionValues; $arrVal++) { if (exists $connectionValues[$arrVal]{$fnum} and exists $c +onnectionValues[$arrVal]{$snum}) { $answeredCount++; } } $queryCount++; } elsif ($cmd eq 'c') { if ($fnum ~~ @connectionValues or $snum ~~ @connectionValues) { my $fHashVal; my $sHashVal; my $valid=0; for( my $arrVal=0; $arrVal < @connectionValues; $arrVal++) { next if (!$connectionValues[$arrVal]); if (exists $connectionValues[$arrVal]{$fnum} and not e +xists $connectionValues[$arrVal]{$snum}) { if ($valid) { $connectionValues[$arrVal]={%{$connectionValue +s[$arrVal]}, %{$connectionValues[$sHashVal]}}; delete $connectionValues[$sHashVal]; $sHashVal=0; } else { $connectionValues[$arrVal]{$snum}=1; $fHashVal=$arrVal; $valid=1; } } elsif (exists $connectionValues[$arrVal]{$snum} and no +t exists $connectionValues[$arrVal]{$fnum}) { if ($valid) { $connectionValues[$arrVal]={%{$connectionValue +s[$arrVal]}, %{$connectionValues[$fHashVal]}}; delete $connectionValues[$fHashVal]; $fHashVal=0; } else { $connectionValues[$arrVal]{$fnum}=1; $sHashVal=$arrVal; $valid=1; } } } } else { push(@connectionValues, {$fnum=>1, $snum=>1}); } } } close (INP); print "$answeredCount,".($queryCount-$answeredCount)."\n";

        Let's start by generating specifications for what you're trying to achieve.

        First start with a short, one-line summary of the project. I'll make one up ... Summarize activity of users at my web site.

        Then generate a description, in English, of what you want to achieve. I'll make stuff up, guessing what i think you might be thinking

        • Read the data file, line by line, specifying the edges of a graph.
          • Split the line into a command, the 'from' node, and the 'to' node.
        • For a connection 'command', if the 'from' node and 'to' node have not been seen, add them.
          • If the 'from' node has been seen, but the 'to' has not, then ... do something
          • if the 'from' node has not been seen, but the 'to' node has, then ... do something

        Once you have an idea of what you want to achieve, then you can consider how to implement it.

        At the moment, if you are asked to "connect 1 2", you create a hash { 1 => 1, 2 => 1}. How do you know whether that is "connect 1 2" or "connect 2 1"?

        If this is homework, you should ask your teacher or TA.

        As Occam said: Entia non sunt multiplicanda praeter necessitatem.

Re: How to process with huge data's
by Anonymous Monk on Oct 04, 2011 at 14:48 UTC

    I don't see major faults in your script, but that much data should easily be calculated within a few minutes, if not seconds. However, here are a few problems.

    Drop the $inputtext = $_ assignment, and change your top-level regexps to have an anchor in the beginning:

    if (m/^q (\d+) (\d+)/)

    Second, the m/\b$fnum\b[^\|]*\b$snum\b/ regexps are forcing the regexp engine to recompile that on almost every run (as the variables change). This is highly inefficient -- the whole if/elsif/else chain should be parsed differently. I'm not sure what the purpose of that block is, but at the very least you can change some of $fnum/$snum into \d+

      Er, one major fault after all: You seem to be building a huge string called $connectionText, and running loads of regexps on it. You should look into replacing it -- a hash might be suitable.

Re: How to process with huge data's
by i5513 (Pilgrim) on Oct 04, 2011 at 15:45 UTC
    Hi,

    I really don't know what are your nodes, and when they disconnect from others

    With your example data, you can simplify to:

    use strict; use warnings; my $queryCount=0; my $answeredCount=0; my %connected=(); while(<>) { print; chomp($_); if (/q (\d+) (\d+)/) { $answeredCount++ if ( defined $connected{$1} and $connected{$2 +}); $queryCount++; } elsif (/c (\d+) (\d+)/) { $connected{$1}=1; $connected{$2}=1; } } print "$answeredCount,".($queryCount-$answeredCount)."\n";

    I suspect you should use a hash more complex may be:

    $connected{firstSet}{$n1}=1; # or better $connected{1}{x}=1; $connected{secondSet}{$n1}=1; # or better $connected{2}{x}=1;

    And detect when you have to delete one node from connected in the correspondient set (delete $connected{x}{N})

    I hope that help, be sure to read perldata

    Regards,

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://929550]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (6)
As of 2024-03-28 22:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found