http://qs321.pair.com?node_id=146249

PyroX has asked for the wisdom of the Perl Monks concerning the following question:

Hello All, Here is the problem, I want to open a big giant file, and read all lines that have the text fwa into an array, then take that array and find out how many duplicates are in it, listing the duplicates and their count. Pretty simple, but then again...
#!/usr/bin/perl system("clear"); open(FILE,"/var/log/everything"); my $i=0; while($input=<FILE>){ if($input=~/fwa/i){ $i++; @parsed[$i]=$input; } } print "\n $i entities"; foreach my $test (@parsed){ my $data=""; my $x=""; my $t=""; foreach my $final (@parsed){ if($final eq $test){ $x++; my $data='valid_time'; } $t++; } if($data eq "valid_time"){ print "$x $test"; } }
So you can see, I load the file, and then go through 2 foreach loops, the first gets the line, the 2nd compares the line to each of the other lines, and then returns true if it is found again, and outputs the count. This does not work at all, and the ram usage is 23MB, not good. Anyone think of a better way? Is there a simple array function to do this? Can I sort the array somehow first? Thanks in advance.

Replies are listed 'Best First'.
Re: File Search And Compare
by particle (Vicar) on Feb 18, 2002 at 21:46 UTC
    here's some well-behaved code that should do what you want. notice use strict;, -w, and error checking. notice the use of a hash (anytime you think unique, think hash.) also note, close filehandles when you're done with them.

    by the way, i've tested this, and it works for me.

    oh, and you should use the reply link to the right of the post, to make sure the author you're replying to sees the response.

    #!/usr/bin/perl -w use strict; $|++; use FileHandle; my $FILE = new FileHandle; # three-argument open, with error handling open($FILE,"<","/var/log/everything") or die "ERROR: can't open file! $!"; # create variables: # $pattern - pattern for regular expression # %parsed - hash, keys are lines containing pattern, # values are counter of times seen # $i - counter of total lines matching pattern my $pattern = 'fwa'; my (%parsed, $i); while(<$FILE>){ { # read line from filehandle, assign to special variable $_ if( /$pattern/i && $i++ ) # search $_ for 'fwa' (case-insensitive) # and increment counter ($i) if found { chomp $_; # remove newline $parsed{ $_ }++; # use line as hash key, increment times seen } } close($FILE); # print output: total entities, and sorted number of each print "\n $i entities\n"; print "$_ x $parsed{$_}\n" for sort keys %parsed;

    ~Particle

Re: File Search And Compare
by impossiblerobot (Deacon) on Feb 18, 2002 at 21:26 UTC
    I think this node by Ovid should help; it's an answer to pretty much the same question.

    I found it using Super Search.

    Update: PyroX, I had hoped you would use Ovid's code as example of how to do what you were trying to do (not just plug his code into yours without understanding why it worked).

    Unfortunately, it looks like you got confused by his use of Perl's build in DATA filehandle (which is often used in demonstration versions of programs on this site).

    The sample code looks like this:
    # open LOG, "< $log" or die "Can't open $log: $!"; while (<DATA>){ push (@data, $_) if $_ =~ /$ip/; } # close LOG;
    Ovid has commented out the lines that open the external file to be read, and is instead reading from the __DATA__ section that appears at the bottom of the same file. To make this work with an external file you would uncomment those lines, and change the filehandle name in the input operator(<>), as follows:
    open LOG, "< $log" or die "Can't open $log: $!"; # Uncommented while (<LOG>){ # Changed filehandle name push (@data, $_) if $_ =~ /$ip/; } close LOG; # Uncommented
    I hope this makes things more clear. :-)

    Impossible Robot
Re: File Search And Compare
by Kozz (Friar) on Feb 18, 2002 at 21:31 UTC
    PyroX:
    I think that what you're really looking for is a hash-based solution. If you were to tie() a hash using the DB_File module (if indeed this is a real MONSTER of a file), this would use the disk rather than memory (correct me if I'm wrong, most wise monks!).

    Perhaps something like
    open(FILE, "< /var/log/everything") or die "Could not read file: $!"; while($input=<FILE>){ if($input =~ /fwa/i){ $tied_hash_ref->{ lc($input) }++; } }
    Notice that I've used lc() to lower-case the text in the line. Otherwise the hash would contain separate values for "fwa100" vs "FWA100" vs "fWa100". Remove this if you desire to keep them separate.

    You could then iterate over this tied hash, printing the key/value pairs. Though to be honest, I've not had a great need for DB_File much, and would welcome other monks to contribute usage examples. ;)
Re: File Search And Compare
by PyroX (Pilgrim) on Feb 18, 2002 at 21:42 UTC

    One Item I have changed, which worked a tiny bit better, but is still too craptastic to ever be used:

    #!/usr/bin/perl system("clear"); open(FILE,"/var/log/everything"); my $i=0; while($input=<FILE>){ if($input=~/fwa/i){ $i++; @parsed[$i]=$input; } } print "\n $i entities"; foreach my $test (@parsed){ my $data=""; my $x=""; my $t=""; foreach my $final (@parsed){ if($final eq $test){ $x++; my $data='valid_time'; } $t++; } if($x>1){ print "$x $test"; } }

    I changed the output control, everything was returning because everything existed at least once (itself).
Re: File Search And Compare
by zengargoyle (Deacon) on Feb 19, 2002 at 06:51 UTC
    If you're lucky your timestamps are fixed width and you can use substr.
    my $ts_fmt = 'MMM DD HH:MM:SS '; my $line = 'Feb 18 00:12:14 foo bar bat baz fwa'; my $time = substr($line, 0, length $ts_fmt, ''); chop $time; # pesky space.. # $time is the time part. # $line holds the part after the time is removed.
    If you're searching for fixed string, index might be faster than regex.
    my $match = 'fwa'; if ( -1 != index($line,$match)) { # matched! $seen_lines{$line}++; }
Re: File Search And Compare
by PyroX (Pilgrim) on Feb 18, 2002 at 22:18 UTC
    OK!

    But now I have a new problem, I think that may work, but there is a timestamp in each line, so I need so split before you process the line in the file. The split needs to be a regular expression split, of ':01-60 ' so that is will be split with ':' and any number 01 - 60 followed by a space, together:

    ':34 ' or ':21 ' or ':57 ' would all work, this is the seconds in the timestamp of course. that should leave us with an array with [0] (the trash) and 1 (the goodies)

    I tried inserting somehting like:
    # create variables: # $pattern - pattern for regular expression # %parsed - hash, keys are lines containing pattern, # values are counter of times seen # $i - counter of total lines matching pattern my $pattern = 'fwa'; my (%parsed, $i); while(<$FILE>){ # read line from filehandle, assign to special variable $_ if(/$pattern/i && $i++) { chomp $_; # remove newline @new=split(/:[01-60] /,$_); $_=$new[1]; $parsed{ $_ }++; # use line as hash ect ect ect.......


    But that didn't work, and I am unsure of both the regex, and ties with your code. Any more help would be much appriciated.
      you don't want to use split like that, it won't do what you want. can you include at least one line of input data? it's rather hard to debug this sort of error without it. you should probably use a regular expression, but i can't say without sample data.

      ~Particle

Re: File Search And Compare
by PyroX (Pilgrim) on Feb 18, 2002 at 21:37 UTC
    Kozz:

    I am interested in more info on your idea, will look, but know anything more? The file is a huge 220,000 lines per day, so by the end of the week it would be gargantuan. I think I am going to do a daily rotate though. Keep em coming guys.
Re: File Search And Compare
by PyroX (Pilgrim) on Feb 19, 2002 at 18:14 UTC
    Thanks Everyone, here is the final product, which seems to be working very well so far.

    #!/usr/bin/perl -w use strict; $|++; use FileHandle; my $FILE = new FileHandle; open($FILE,"<","/var/log/$ARGV[0]") or die "ERROR: can't open file! $! +"; my $pattern = 'fw1'; my (%parsed, $i); while(<$FILE>){ if(/$pattern/i && $i++){ my $z=0; my $out=""; chomp $_; @new=split(/[0-9][0-9]:[0-9][0-9]:[0-9][0-9]/,$_); foreach $piece (@new){ $z++; chomp $piece; $out.="$piece"; } @new=split(/service/,$out); $out=$new[0]; print "$out\n"; } } close($FILE);


    Thanks Again!
      Except one small oops:    $i++   returns 0 on the first post-increment, so the    if (/pat/ && ...   fails the first time a line matches. update My oops! That seems to be what you want, or something like it.   Your original seemed to look for duplicate _full_ lines.   But nevermind...

        p
Re: File Search And Compare
by PyroX (Pilgrim) on Feb 22, 2002 at 02:42 UTC
    Yea, I should make note of that change, this gets lines with the text, and I pipe it to 'uniq -cid' to tell me the count of similar lines.
Re: File Search And Compare
by PyroX (Pilgrim) on Feb 18, 2002 at 21:33 UTC
    If only that worked:
    Name "main::DATA" used only once: possible typo at ./pix2 line 14. readline() on closed filehandle main::DATA at ./pix2 line 14.