http://qs321.pair.com?node_id=1089262

biboshakan has asked for the wisdom of the Perl Monks concerning the following question:

What's up perlmonks, hope all is good. I'm pretty new to perl and i've been assigned to write something for text processing. basically my inputs are two text files, one of them contains 3 million numbers, and the other 3000. we are interested in the first 8 digits of the numbers in the file with 3 million numbers, and we want to see how many devices matches the ones in 3000 numbers file(8 digits). the output should be the top 100 devices, since if we take the 8 digits only we have lots of repetition. hope someone could help me and even give me some code. thanks a lot :) ....

Thanks for the help guys, i started doing it and it kinda worked, except when i use strict, i can't seem to know why, could you help me out? that's the code i have so fare

open (IMEI, 'IMEI.txt'); open (TAC, 'tac.txt'); %mapToModel =(); %keyCount = (); while (<TAC>){ $key = substr $_,0,8; $model = substr $_,9; $keyCount{$key} = 0; $mapToModel{$key} = $model; } close (TAC); while (<IMEI>){ $subs = substr $_, 0, 8; if(exists $keyCount{$subs}){ $keyCount{$subs} = $keyCount{$subs}+1; } } close (IMEI); foreach $key (keys %keyCount){ if (exists $mapToModel{$key}){ $model = $mapToModel{$key}; $count = $keyCount{$key}; if( $count !=0){ print "$count\t$model"; } } }

Replies are listed 'Best First'.
Re: Perl text processing
by toolic (Bishop) on Jun 09, 2014 at 12:45 UTC
    • Read perlintro.
    • Write some code.
    • When you encounter problems, post your code and ask specific questions.
Re: Perl text processing
by davido (Cardinal) on Jun 09, 2014 at 15:47 UTC

    Open the 2nd file (the 3000 number file). Read the first eight digits of each entry into a hash, where the eight-digit number is a hash key.

    Open the 3000000 number file. For each entry in that file, take the first eight digits, and check for the existence of a matching hash key. If one exists, increment the value indexed by that key.

    Sort the keys by their corresponding value.

    Print the first 100 entries from the sorted list of hash keys.

    Start with this:

    #!/usr/bin/perl # ...or whatever shebang line your system requires use strict; use warnings; # Follow the steps outlined above...

    perlintro will contain everything you need for this. It takes about a half hour to read through it. After having read perlintro, and after getting started, you will probably come up with more specific questions. When asked along with code to illustrate your sticking points, we'll be able to help you along.


    Dave

Re: Perl text processing
by wjw (Priest) on Jun 09, 2014 at 15:31 UTC

    Start with this:

    #!/usr/bin/perl use strict; use warnings; #psudo code from here on # open the file with categories # read the categories in (probably to a hash where the key is the cate +gory and the value 0) # close the categories file (you have your hash, you don't need to rea +d from the file anymore) # open the file with 3000k entries # read the file line by line # for each line read # trim to the first 8 characters # look for that value in the hash keys # increment the value of the hash key that is matched if any # you now have a hash with category as the key, and the number found a +s the value # you should be able to figure out how to find the top 100 values and +print out the key and value for each of them or store them to file
    You can do this assignment with nothing but the basic Perl functionality.

    Run the program using the -d (debugger) and learn to use that tool to examine and learn what those hashes and any other variables look like. It is a quick tool to learn to use if just doing basic examining for self-enlightenment.

    PerlDoc is your friend. Tutorials like perldsc, perlop, perlfunc will all help you solve this pretty quickly, including example code much of the time.

    Hope you find this helpful... Update:

    Note that by using a hash, you eliminate the possibility of there being duplicate categories, simplifying and possibly making the effort more efficient.

    Restated the increment step for clarity(I hope)

    ...the majority is always wrong, and always the last to know about it...

    Insanity: Doing the same thing over and over again and expecting different results...

    A solution is nothing more than a clearly stated problem...otherwise, the problem is not a problem, it is a facct

Re: Perl text processing
by Bloodnok (Vicar) on Jun 09, 2014 at 12:48 UTC
    Further to toolic's suggestions, at the very least, you could and even probably should, provide some sample data.

    A user level that continues to overstate my experience :-))
Re: Perl text processing
by neilwatson (Priest) on Jun 09, 2014 at 12:47 UTC
    Happy to help, but you have to show some work. How far did you get on your own? What particular problem are you having?

    Neil Watson
    watson-wilson.ca

Re: Perl text processing
by vinoth.ree (Monsignor) on Jun 09, 2014 at 13:19 UTC
    Hi biboshakan first of all welcome to PerlMonks!

    i've been assigned to write something for text processing.

    This is your task, first try yourself, if you struggle come here with the code you tried, we will help you better.


    All is well

      All good, it's working now after I've initialized the values :) the output is all the phone models in Uganda with the number of users having these phones. However, there are duplicates of models, with different values. Can you think of a way to add those values and have a unique model with all the values added up?

      #!/usr/bin/perl use strict; use warnings; use diagnostics; open (IMEI, 'IMEI.txt'); open (TAC, 'tac.txt'); my %mapToModel; my %keyCount; my $key; my $model; my $subs; my $keyCount; my $mapToModel; my $count; my %dictionary; while (<TAC>){ $key = substr $_,0,8; $model = substr $_,9; $keyCount{$key} = 0; $mapToModel{$key} = $model; } close (TAC); while (<IMEI>){ $subs = substr $_, 0, 8; if(exists $keyCount{$subs}){ $keyCount{$subs} = $keyCount{$subs} +1; } } close (IMEI); foreach my $key (keys %keyCount){ if (exists $mapToModel{$key}){ $model = $mapToModel{$key}; $count = $keyCount{$key}; foreach my$model (keys ) if( $count !=0){ print "$count\t$model"; } } }
Re: Perl text processing
by perlfan (Vicar) on Jun 09, 2014 at 14:34 UTC
    From an efficiency point of view, it'd be much more so to encode your 3000 categories into a Trie. After that, run your 3 million entries over it. It then becomes a matter of traversing the trie 3 million times rather than making 3,000,000 x 3,000 compares. Sorry I have no code to give.
Re: Perl text processing
by Utilitarian (Vicar) on Jun 09, 2014 at 15:17 UTC
    Let's call the file with the interesting categories imei_tags.txt and the file with 3 million devices imeis.txt then something like the following would do the job...
    WARNING: Untested off the top of my head code
    #! /usr/bin/perl use strict; use warnings; open (my $imeitag , '<', '/path/to/imei_tags.txt'); my %devices; while(<$imeitag>){ $devices{$1}=0 if /^\s*(\d{8}$/; } close $imeitag; open (my $imei, '<', '/path/to/imeis.txt'); while(<$imei>){ my $imeitag=$1 if /^\s*(\d{8}\d+\s*$/ $devices{$imeitag}++ if defined $devices{$imeitag}; } close $imei; my $count=0; for my $imeitag (sort {$devices{$a}<=>$devices{$b}} keys %devices){ print "$imeitag\n"; $count++; last if $count >=100; }
    print "Good ",qw(night morning afternoon evening)[(localtime)[2]/6]," fellow monks."

      Some untested thoughts:

      open (...);

      Return status of open statements not checked; alternately,  use autodie; not used.

      $devices{$1}=0 if /^\s*(\d{8}$/;

      Capture group  (\d{8}$ not closed.

      my $imeitag=$1 if /^\s*(\d{8}\d+\s*$/

      Capture group  (\d{8}\d+\s*$ not closed; statement not terminated (missing semicolon);  my $imeitag ... if ... ; conditional creation of lexical (pre-state static variable hack).

      for my $imeitag (sort {$devices{$a}<=>$devices{$b}} keys %devices){
          print "$imeitag\n";
          $count++;
          last if $count >=100;
      }

      Sorts keys of hash in ascending numerical order, but then prints first 100 keys, which does not seem in accord with requirement to "output ... the top 100 devices" (whatever "top" may exactly mean in the context of the OP).