Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Excluding words with more than three a

by Anonymous Monk
on Dec 20, 2001 at 05:54 UTC ( #133368=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks, I believe this is a simple question, but I ran in troubles when I tried to resolve it. Here is the problem: I've a list of words (a dictionary), and I need to exclude from it all the words that have more than three a, or more than two b, or more than two k. How can this be done? Many thanks for any advise. Best regards, Richard

Replies are listed 'Best First'.
Re: Excluding words with more than three a
by japh (Friar) on Dec 20, 2001 at 06:02 UTC
    perl -ne 'print unless (/a.*a.*a/ || /b.*b/ || /k.*k/) }' dictionary.txt
Re: Excluding words with more than three a
by wog (Curate) on Dec 20, 2001 at 06:05 UTC
    perl -ne'y/a//>3|y/b//>2|y/k//>2||print' file # 12345678901234567890123456789012345 # 1 2 3
      Thank you, wog, I was looking for code similar as your approach, but I failed to find any suitable solution. I've tweaked and wrapped the code a bit, just to run it under MacPerl.
      #!perl -w use strict; my $tmpdir = 'Disco:Carpeta_Idioma:'; my $file = "$tmpdir" . 'diccionario'; my $file_out = "$tmpdir" . 'diccionario2'; open(IN, "<$file") || die "Can't open $file: $!\n"; open(OUT, ">$file_out") || die "Can't open $file_out: $!\n"; while (<IN>) { print OUT unless ( y/a//>3 | y/b//>1 | y/c//>1 | y/d//>2 | y/e//>8 | y/g//>1 | y/i//>3 | y/j//>1 | y/l//>2 | y/m//>3 | y/n//>3 | y/o//>4 | y/r//>3 | y/s//>6 | y/t//>1 | y/u//>2 | y/v//>2 ) } close IN; close OUT;
      The code snippet looked nice indented on my Mac. Hope it looks fine here too. Best regards, Richard
(crazyinsomniac) Re: Excluding words with more than three a
by crazyinsomniac (Prior) on Dec 20, 2001 at 10:44 UTC
    Some monks have smelled homework ...
    Hi Monks, I believe this is a simple question, but I ran in troubles when I tried to resolve it. Here is the problem: I've a list of words (a dictionary), and I need to exclude from it all the words that have more than three a, or more than two b, or more than two k. How can this be done? Many thanks for any advise. Best regards, Richard
    I did not like this post because when I read "this is a simple question" and "I ran in troubles when I tried to resolve it" all I can respond with is "liar liar, fingers on fire".

    As this is a site for "learning" perl, and enhancing your skill, after I see "How can this be done?", my first thought is "write a program", immediately after which I think "using perl" and then I think of a couple of potential strategies in non specific programming terms (pseudo code) and then I think in terms of accomplishing this with perl.

    Since you don't even attempt to solve the problem, and you do not seem to be able to express your problem in programming terms, I assume this is a homework question (since practically nobody who is very new to perl would come up with such a problem for themselves to solve).

    I sincerely doubt you even know how to write a simple "Hello world" program. I may be wrong, and surely if I am, i'd like you to tell me. I dont think I am, but it could be the case that your question asking strategy might be a little flawed or whatever.

    Some monks replied the way I would've (if I they didn't) with some reading material, giving you everything you needed to know (strategy, functions, everything) without literally writing the program for you.

    This is good, this is good advice. You say "Many thanks for any advise" which makes me a little sad because you cannot effectively get any advice that will do you any good (you might get a solution, but I seriously doubt it would do you much good).

    Judging from the way you asked your question, you'll just grab a working solution without knowing how or even why it works. This is not learning. If you truly wanted to learn, you would've (should've) described a what strategy you tried using to solve your problem that failed you, so that someone can help you with appropriate advice (maybe you had a poor strategy, didn't know how to use a certain function or what have you, didn't know the syntaxt for ...).

    I hope you read this and rethink how you'll ask the question next time.

    Here are a few links I think everybody shold visit:
    How to RTFM
    On asking for help
    Don't just provide a module name

     
    ___crazyinsomniac_______________________________________
    Disclaimer: Don't blame. It came from inside the void

    perl -e "$q=$_;map({chr unpack qq;H*;,$_}split(q;;,q*H*));print;$q/$q;"

      Hi crazyinsomniac,

      As my second language, I am sorry if my poor English mislead you to such funny exegesis of my post, and though I found some good points in it, other points were really dead wrong. Anyway, since some advises you offered seemed to me as good ones, I stand corrected and I would try to make a better move on future questions. So thanks for bring them.

      Also, you wondered about my perl skills. I am aware that they are not strong -- I'm only starting with perl and no, my question wasn't a homework assignment, as I am not taking perl lessons --, but they are not so weak as to couldn't write a "Hello world" program, as you unfortunately guessed. Here is a code snippet, the one I wrote to find out how many times a letter appears in the names of the days (please read my other replies to know more about the target problem). It runs fine under MacPerl, so if you like to check it out, you may need to change it a bit (I guess only the first line).

      #!perl -w use strict; my %semanal; while (<DATA>) { chomp; foreach my $letra (split //) { $semanal{$letra}++; } } foreach my $pal (sort keys %semanal) { print "$pal\t$semanal{$pal}\n"; } __DATA__ lunes martes miercoles jueves viernes sabado domingo output: a 3 b 1 c 1 d 2 e 8 g 1 i 3 j 1 l 2 m 3 n 3 o 4 r 3 s 6 t 1 u 2 v 2
      Thank you.

      Best regards,
      Richard
        Hey, I'm glad I was wrong (well, on some points). That's pretty good actually (as knowledge of m// and y// escaped me for a long time when I started learning perl) and that is how I would've approached it about two years ago. In terms of efficiency, and this is without any benchmarks of this particular problem, but from previous knowledge, your best bet would be to use y// aka tr//.

         
        ___crazyinsomniac_______________________________________
        Disclaimer: Don't blame. It came from inside the void

        perl -e "$q=$_;map({chr unpack qq;H*;,$_}split(q;;,q*H*));print;$q/$q;"

(ichimunki) Re: Excluding words with more than three a
by ichimunki (Priest) on Dec 20, 2001 at 06:08 UTC
    I suggest reading 'perldoc perlopentut', 'perldoc -f grep' and 'perldoc -f tr'. Of course 'perldoc perlrun' might give some clues as to how you might avoid reading 'perldoc perlopentut' (as it's rather long and dull) and simply writing this whole thing as a one-liner from the command line.
Re: Excluding words with more than three a
by merlyn (Sage) on Dec 20, 2001 at 06:35 UTC
      Does it remind you of bananas, like old-school purple mimeographs?
      No, not really. I often need to massage text, not only for work related issues, but sometimes for pleasure too, that is, to solve word games or at least to find a clue. Having heard elsewhere about perl superb capacities to deal with text stuff, I began involved with this language in mid-2001. Since then I have read what I could in my free time, but unfortunately it seems not enough yet.

      In case you are wondering, my posted question were related to a word game (in Spanish): reorder all the letters of the names of the days and find four words (or less, unlikely).

      At first I tried to find out how many words match those criteria, so I began to filter a 100,000 word list, excluding words with unwanted letters, and words with less than 7 letters, but when I tried to filter words with more than the required amount of letters (that is, words with more than 3 a's, etc.) I was unable to code a working solution. So I asked for help.

      After running a solution provided by wog, I still have in the output file almost 33,000 words, so I must think on more filters to apply. And no, I won't even try myself to solve all the problem with perl, since as you surely have already guessed my current perl skills are way behind it.

      If you like to post some suggestions or other advises, I will be very pleased to read them.

      Thank you.

      Best regards,
      Richard
Re: Excluding words with more than three a
by Spenser (Friar) on Dec 20, 2001 at 06:37 UTC

    How about this:

    #!/usr/bin/perl -w use strict; open(WORDS, "<dictionary.txt") || die "Cannot open dictionary file."; foreach(<WORDS>){ if(!/b.*b/ && !/a.*a.*a/ && !/k.*k/ ){ push(@_, $_); } } close(WORDS); print @_,; exit;
Re: Excluding words with more than three a
by atcroft (Abbot) on Dec 20, 2001 at 11:20 UTC

    Below is my humble offering (including test code) to this question-if it does not serve your need, may it at least serve as a source of instruction by those wiser of mistakes to avoid.

    #!/usr/bin/perl -- -l -w use strict; my @mylist = qw(a b k aa bab kick joke aaaaaaa babab kickker); my @ignore = qw(a.*a.*a.*a b.*b.*b k.*k.*k); print(join(' ', @mylist), "\n"); # Display, space-seperated foreach my $toignore (@ignore) { @mylist = grep(!/$toignore/i, @mylist); } print(join(' ', @mylist), "\n"); # Display, space-seperated

    Am I incorrect in seeing that you want to ignore if count(a) > 3, count(b) > 2, or count(k) > 2, meaning that 3 a's, 2 b's, and/or 2 k's are allowable? Basically, I'm filtering @mylist using the grep() function to match against anything not containing the current pattern I'm looking at, case-insensitively. Also, since my @ignore list is the patterns to match, would they be better served by qr() rather than qw()? (Still learning those types of functions.) Are those grep()s going to tax the machine badly this way?

      Aha! Even if you are not our original AnonyMonk querant, we at least from atcroft get code which purports to work but for which the author is still asking questions! Excellent.

      While there is good reason to avoid *any* premature optimization at the "tax the machine" level... we have some algorithmic improvements we might make here to enhance readability and future maintainence. To wit: you are setting up two loops, the foreach, then the grep-- as such testing each word as many times as there are tests. In some sense this is necessary since we have more than one test we need to apply to each word, in another sense it probably isn't the best way to go about it.

      I also think we might be able to avoid keeping the whole list in memory (although I understand we might hard code the list into our script for testing). We'd really like to be able to adapt our technique to streams of data, so we don't necessarily have to have all the data at once and then we could potentially filter an infinite list.

      With this in mind, let's see if we can't find a test that, in as few steps as possible, will evaluate a single word (and for the purposes of this demo, I'm going to assume we're looking for words which have double characters, rather than the multiples shown, we can easily adjust the REs again later to do the larger tests). Working with your code we might come up with something like $my_word =~ m/a.*a/. Which is great, except that .* is a greedy little thing, and even if we feed it 'ababab', it will start at 'a' then find 'aba'. Ah, a match, but it's not done-- it's trying to match as much as possible (which is why we call it greedy). It keeps examining the string until we get to 'ababa'... and even then it has to verify that there are no further 'a's in our string. If we use $my_word =~ m/a.*?a/ we tell the RE engine to stop looking as soon as it finds the first match, 'aba' in this case.

      So now we've got the first test... how to do the other tests? We could set up a truth value, like $is_matching, and then do a series of if( $my_word =~ m/a.*?a/ ) { $is_matching = 1; } statements, changing the contents of the regex each time, but then we duplicate our $is_matching blocks. So we might try if ( $my_word =~ m/a.*?a/ || $my_word =~ m/b.*?b/ ) { $is_matching = 1; }. But we can probably skip the truth condition now and just put our action into the block. So for this exercise print "$my_word\n" if (...) (note we're switching the block to the front and putting if at the end). Also, we're separating these matches and using || to short-circuit. Since our condition is true if any part is true, this stops testing as soon as it finds a true part.

      Finally we need to get our test into a block worth worrying about. We might make it a portable sub, we might make it an anonymous code reference (almost a sub), and we might just put it right in the block of some other list iterator like foreach, map, or grep-- and yes, we could even call perl with the -n flag and have it feed lines from STDIN to our program.

      But I think we might opt for a simple while( <STDIN> ) { ... test ... } approach. This has the same effect on our program as calling it with perl -n but lets us forget to do that. Then we can put our program into a stream easily. On my W2K machine I type type dict | perl filter.pl > dict2 (on Unix/Linux this would be cat dict | ./filter.pl > dict2-- these both take a dict file, filter it, and write it to a dict2 file). And since we've written the program fairly usefully, we might later modify it so that we can call it with perl filter.pl dict dict2... but I digress.

      Here is our program so far:
      #!/usr/bin/perl -w use strict; while( <STDIN> ) { print if not( /a.*?a/ || /b.*?b/ ); }


      Yes, we got rid of $my_word. We can rely on the fact that the while is assigning each line from STDIN to $_ and giving that to our block and the match is going to default to comparing our expression against $_ if no variable is given-- and print will print $_ if no argument is given, too.

      There are further things we might eventually consider to optimize this routine for the CPU (after we're done writing it, though), like adding the /o flag to our regular expressions. If they aren't going to change, there's no reason for perl to recompile them each time it uses them, which it might if we don't tell it not to. If we change the print to a return, we also have a block which is easy to put inside a grep or map to take a list and assign to another list. And we could go on forever... and my apologies if parts of this rather long post were oversimplified.
Re: Excluding words with more than three a
by Anonymous Monk on Dec 20, 2001 at 06:45 UTC
    This shud werk:
    open(FH, "dikshunary") or die "Can't open dictionary: $!"; while (<FH>) { my ($a, $b, $k) = (0,0,0); for my $i (0..length) { $a=$a+1 if (split '', $_)[$i] eq "a"; $b=$b+1 if (split '', $_)[$i] eq "b"; $k=$k+1 if (split '', $_)[$i] eq "k"; } next if $a > 3 or $b > 2 or $k > 2; print; }

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://133368]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (2)
As of 2022-11-27 08:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Notices?