Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

grep help

by sickboy (Initiate)
on Jan 22, 2002 at 04:00 UTC ( [id://140533]=perlquestion: print w/replies, xml ) Need Help??

sickboy has asked for the wisdom of the Perl Monks concerning the following question:

This node falls below the community's threshold of quality. You may see it by logging in.

Replies are listed 'Best First'.
Re: grep help
by japhy (Canon) on Jan 22, 2002 at 09:15 UTC
    You don't really get it. ar0n tries to help you, and you insult him. I'm going to critique your code for you, and tell you how I would do it.

    Your code, as you posted, does not work. Your hash, %friends has the same hash reference in all its values. Why? Because you didn't declare $sucker as a lexical; if you understand the consequences, you'd know why you'd have to.

    Your use of regexes can be totally wiped out, and in particular, your regex /^j.*/i goes overboard and could merely have been /^j/.

    You're sorting the keys of %friends, which all happen to be numbers, but would end up being sorted ASCIIbetically instead of numerically, so the keys would be (1,10,11,12, ... ,2,20,21, ... ,3,4, ...). You probably don't want that. You should be using an array instead of a hash.

    Here is how I would write your code.

    my (@friends, @results); for (@raw_list) { my %who; @who{qw( FIRSTNAME LASTNAME HAIRCOLOR )} = split /,/; push @friends, \%who; } @results = grep { lc($_->{HAIRCOLOR}) eq 'brown' and lc(substr($_->{FIRSTNAME})) eq 'j' } @friends;
    Don't insult my friends. That's a ground rule here.

    _____________________________________________________
    Jeff[japhy]Pinyan: Perl, regex, and perl hacker.
    s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

      Hi Japhy. Didnt see this thread until I had finished a long reply to sickboys original.

      But, it turns out the lc stuff is bad advice. Its actually faster with a regex. Anyway, see below...

      Yves / DeMerphq
      --
      When to use Prototypes?

    A reply falls below the community's threshold of quality. You may see it by logging in.
(ar0n) Re: grep help
by ar0n (Priest) on Jan 22, 2002 at 04:19 UTC

    It's an okay way, but you don't really need regular expressions to match the haircolor or the firstname, since you're just matching string constants ("brown" and "j").

    And since the keys are just indices, it's probably better to use an array:

    my @friends; for (@raw_list) { my @friend = split /,/; push @friends, { FIRSTNAME => $friend[0], LASTNAME => $friend[1], HAIRCOLOR => $friend[2] }; } my @results = grep { lc $_->{HAIRCOLOR} eq 'brown' and 'j' eq substr(lc $_->{FIRSTNAME}, 0, 1) } @friends;

    The new line in the grep block, is, of course, only a style decision.


    Update: Yes, I have tried it, and it works for me. What error is it giving?

    Update 2: You've changed the format of the input file.

    When splitting, spaces matter. Have you tried your own code with "john, smith, brown"? It doesn't work.

    Anyhow, replace the split statement above with split /\s*,\s*/


    [ ar0n -- want job (boston) ]

    A reply falls below the community's threshold of quality. You may see it by logging in.
    A reply falls below the community's threshold of quality. You may see it by logging in.
Re: grep help
by demerphq (Chancellor) on Jan 23, 2002 at 00:47 UTC
    Hi. Well, a couple of minor comments.

    First, I would say that it might be a good idea to get out of the habit of doing $$ for non simple refs. They are confusing to read and dont work for more complicated structures. Use -> as a dereference operator and your code will be more readable and there are only a very few scenarios where they cant be used.

    Second you are sorting your data prior to the grep, which is ok for small data sets but doesnt scale well. (Infact a close inspection of the way you are doing this indicates you actually pass over your entire data set 3 times, with the sort being treated as 1 pass which is an underestimate.) Instead sort your data AFTER the grep, this is better because there should be less records to sort. Unfortunately with your current data structure this isnt too easy, you might for instance consider putting the id inside the record as well as being the key to the hash containing the records.

    Third, assuming you do want the list sorted by key you might want to be aware that currently you are getting a lexicographical sort order and not a numeric order (ie lexicographical:1<10<2<20 numeric 1<2<10<20)

    Fourth minor point but, /^j.*/ is the same as /^j/. While its not bad (optimizer will sort it out I bet, er maybe not, after all) errors are minimized by saying what you mean precisely.

    Anyway, ill give you a couple of variants, along with some benchmarking code to show what I mean.

    use strict; use warnings; use Data::Dumper; use Benchmark 'cmpthese'; our %friends; our @results; my $counter=0; my @DATA=<DATA>; chomp @DATA; sub setup_data { my $repeat=shift || 1; for (1..$repeat) { foreach (@DATA) { $friends{++$counter}={}; @{$friends{$counter}}{qw(ID FIRSTNAME LASTNAME HAIR)} = ($c +ounter,split(/,/, $_)); } } print Dumper(\%friends) unless $repeat>1; } sub sickboy { @results=grep ($$_{HAIR}=~ /brown/i && $$_{FIRSTNAME}=~/^j.*/i, @friends{sort keys %friends}); } sub sickboy_rewrite { # Minor rewrite using -> and non superfluous regex, with /o modifier a +s well @results=grep ($_->{HAIR}=~ /brown/io && $_->{FIRSTNAME}=~/^j/i, @friends{sort keys %friends}); } sub internal_id { # assuming that the ID is contained in the record, as well as being th +e key # uncomment the next line to see how this could work with your existin +g code # $friends{$_}->{ID}=$_ foreach keys %friends; @results=sort { $a->{ID} <=> $a->{ID} } #numeric sort grep { $_->{HAIR}=~/brown/i && $_->{FIRSTNAME}=~/^j/i } values %friends; } sub no_internal_id { # assuming there is _no_ way to add an ID to each record, but sort is +still after the grep @results=map { $friends{$_} } sort { $a <=> $b } grep { $friends{$_}->{HAIR}=~/brown/i && $friends{$_}->{FIRSTNAME}=~/^j/i } keys %friends; } sub test_and_display { local $\="\n"; sickboy; print Dumper(\@results); sickboy_rewrite; print Dumper(\@results); internal_id; print Dumper(\@results); no_internal_id; print Dumper(\@results); } # for debugging # setup_data; # test_and_display; for my $x (1,10,100,1_000,10_000,10_000) { setup_data($x); print "\n\nBenchmarking hash of ",scalar keys %friends," records.\ +n"; cmpthese(-30, {sickboy=>\&sickboy, sickboy_rewrite=>\&sickboy_rewrite, internal_id=>\&internal_id, no_internal_id=>\&no_internal_id, }); } __DATA__ John,Brown,red Fred,Flintstone,black Jane,Brown,blond Betty,Rubble,black John,Kennedy,brown Jodi,Brown,brown Bill,Black,blond June,Green,brown John,McEnroe,brown Erin,White,brown Sean,Comb,green Lexius,Luser,black Monty,Python,balding Joe,Bloggs,white Ian,Flemming,red Zoe,Morte,blonde Peter,Phillips,chrome
    Running this code you will see that for trivially small data sets your algorithm is the fastest with the internal_id variant marginally slower. However as the data set scales up the performance difference becomes more and more marked. At 20k records internal_id is around 80% faster and even no_internal_id() surpassing your original. Anyway, the benchmark results are as follows, and I think its pretty clear which variant I would use...
    Benchmarking hash of 17 records.
    Benchmark: running internal_id, no_internal_id, sickboy, sickboy_rewrite, 
               each for at least 30 CPU seconds...
    internal_id: 32 wallclock secs (31.25 usr +  0.00 sys = 31.25 CPU) @ 12040.13/s (n=376254)
    no_internal_id: 33 wallclock secs (31.66 usr +  0.02 sys = 31.67 CPU) @ 8681.97/s (n=274984)
       sickboy: 31 wallclock secs (31.38 usr +  0.00 sys = 31.38 CPU) @ 13151.01/s (n=412613)
    sickboy_rewrite: 32 wallclock secs (31.59 usr +  0.00 sys = 31.59 CPU) @ 13379.41/s (n=422709)
                       Rate no_internal_id  internal_id      sickboy sickboy_rewrite
    no_internal_id   8682/s             --         -28%         -34%            -35%
    internal_id     12040/s            39%           --          -8%            -10%
    sickboy         13151/s            51%           9%           --             -2%
    sickboy_rewrite 13379/s            54%          11%           2%              --
    
    
    Benchmarking hash of 187 records.
    Benchmark: running internal_id, no_internal_id, sickboy, sickboy_rewrite, 
               each for at least 30 CPU seconds...
    internal_id: 32 wallclock secs (31.64 usr +  0.00 sys = 31.64 CPU) @ 1240.35/s (n=39246)
    no_internal_id: 33 wallclock secs (31.48 usr +  0.00 sys = 31.48 CPU) @ 832.23/s (n=26202)
       sickboy: 32 wallclock secs (31.80 usr +  0.00 sys = 31.80 CPU) @ 1116.87/s (n=35513)
    sickboy_rewrite: 32 wallclock secs (31.66 usr +  0.00 sys = 31.66 CPU) @ 1143.45/s (n=36197)
                      Rate no_internal_id       sickboy sickboy_rewrite  internal_id
    no_internal_id   832/s             --          -25%            -27%         -33%
    sickboy         1117/s            34%            --             -2%         -10%
    sickboy_rewrite 1143/s            37%            2%              --          -8%
    internal_id     1240/s            49%           11%              8%           --
    
    
    Benchmarking hash of 1887 records.
    Benchmark: running internal_id, no_internal_id, sickboy, sickboy_rewrite, 
               each for at least 30 CPU seconds...
    internal_id: 33 wallclock secs (31.76 usr +  0.00 sys = 31.76 CPU) @ 77.54/s (n=2463)
    no_internal_id: 32 wallclock secs (31.61 usr +  0.00 sys = 31.61 CPU) @ 51.47/s (n=1627)
       sickboy: 33 wallclock secs (31.77 usr +  0.00 sys = 31.77 CPU) @ 61.61/s (n=1957)
    sickboy_rewrite: 33 wallclock secs (31.48 usr +  0.00 sys = 31.48 CPU) @ 62.16/s (n=1957)
                      Rate no_internal_id       sickboy sickboy_rewrite  internal_id
    no_internal_id  51.5/s             --          -16%            -17%         -34%
    sickboy         61.6/s            20%            --             -1%         -21%
    sickboy_rewrite 62.2/s            21%            1%              --         -20%
    internal_id     77.5/s            51%           26%             25%           --
    
    
    Benchmarking hash of 18887 records.
    Benchmark: running internal_id, no_internal_id, sickboy, sickboy_rewrite, 
               each for at least 30 CPU seconds...
    internal_id: 32 wallclock secs (31.45 usr +  0.00 sys = 31.45 CPU) @  7.28/s (n=229)
    no_internal_id: 33 wallclock secs (31.95 usr +  0.00 sys = 31.95 CPU) @  4.54/s (n=145)
       sickboy: 31 wallclock secs (30.23 usr +  0.00 sys = 30.23 CPU) @  4.04/s (n=122)
    sickboy_rewrite: 31 wallclock secs (31.33 usr +  0.00 sys = 31.33 CPU) @  4.05/s (n=127)
                      Rate       sickboy sickboy_rewrite no_internal_id  internal_id
    sickboy         4.04/s            --             -0%           -11%         -45%
    sickboy_rewrite 4.05/s            0%              --           -11%         -44%
    no_internal_id  4.54/s           12%             12%             --         -38%
    internal_id     7.28/s           80%             80%            60%           --
    
    
    Benchmarking hash of 188887 records.
    Benchmark: running internal_id, no_internal_id, sickboy, sickboy_rewrite, 
               each for at least 30 CPU seconds...
    internal_id: 31 wallclock secs (30.73 usr +  0.00 sys = 30.73 CPU) @  0.72/s (n=22)
    no_internal_id: 31 wallclock secs (31.00 usr +  0.02 sys = 31.01 CPU) @  0.42/s (n=13)
       sickboy: 32 wallclock secs (31.24 usr +  0.00 sys = 31.24 CPU) @  0.29/s (n=9)
    sickboy_rewrite: 32 wallclock secs (31.11 usr +  0.00 sys = 31.11 CPU) @  0.29/s (n=9)
                    s/iter       sickboy sickboy_rewrite no_internal_id  internal_id
    sickboy           3.47            --             -0%           -31%         -60%
    sickboy_rewrite   3.46            0%              --           -31%         -60%
    no_internal_id    2.39           45%             45%             --         -41%
    internal_id       1.40          148%            147%            71%           --
    
    
    Benchmarking hash of 358887 records.
    Benchmark: running internal_id, no_internal_id, sickboy, sickboy_rewrite, 
               each for at least 30 CPU seconds...
    internal_id: 32 wallclock secs (30.25 usr +  0.00 sys = 30.25 CPU) @  0.36/s (n=11)
    no_internal_id: 34 wallclock secs (32.20 usr +  0.17 sys = 32.37 CPU) @  0.22/s (n=7)
       sickboy: 35 wallclock secs (33.37 usr +  0.02 sys = 33.39 CPU) @  0.15/s (n=5)
    sickboy_rewrite: 34 wallclock secs (33.41 usr +  0.00 sys = 33.41 CPU) @  0.15/s (n=5)
                    s/iter sickboy_rewrite       sickboy no_internal_id  internal_id
    sickboy_rewrite   6.68              --           -0%           -31%         -59%
    sickboy           6.68              0%            --           -31%         -59%
    no_internal_id    4.62             44%           44%             --         -41%
    internal_id       2.75            143%          143%            68%           --
    
    Anyway, hope that helps.

    PS:I realize that much of this has already been said by japhy, originally this post was going to be to the other thread, but it seems more topical here (er, at least now that I know this thread exists).
    BTW:I have the feeling that you should adopt a slightly less antagonistic attitude. You'd get less hassle and much more respect.

    Yves / DeMerphq
    --
    When to use Prototypes?

Re: grep help
by goldclaw (Scribe) on Jan 22, 2002 at 04:08 UTC
    try replacing @friends{sort keys %friends} with values %friends. Even better, dont use a hash at all for %friends, use an array instead.

    The grep is otherwise perfectly OK.

    goldclaw

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://140533]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (3)
As of 2024-04-16 15:41 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found