grep help

sickboy has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: grep help by japhy (Canon) on Jan 22, 2002 at 09:15 UTC
You don't really get it. ar0n tries to help you, and you insult him. I'm going to critique your code for you, and tell you how I would do it. Your code, as you posted, does not work. Your hash, `%friends` has the same hash reference in all its values. Why? Because you didn't declare `$sucker` as a lexical; if you understand the consequences, you'd know why you'd have to. Your use of regexes can be totally wiped out, and in particular, your regex `/^j./i` goes overboard and could merely have been `/^j/`. You're sorting the keys of `%friends`, which all happen to be numbers, but would end up being sorted ASCIIbetically instead of numerically, so the keys would be (1,10,11,12, ... ,2,20,21, ... ,3,4, ...). You probably don't want that. You should be using an array instead of a hash. Here is how I would write your code. `my (@friends, @results); for (@raw_list) { my %who; @who{qw( FIRSTNAME LASTNAME HAIRCOLOR )} = split /,/; push @friends, \%who; } @results = grep { lc($_->{HAIRCOLOR}) eq 'brown' and lc(substr($_->{FIRSTNAME})) eq 'j' } @friends;` [download] Don't insult my friends. That's a ground rule here. _____________________________________________________ Jeff`[japhy]`Pinyan: Perl, regex, and perl* hacker. `s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;`	[reply] [d/l]
Re: Re: grep help by demerphq (Chancellor) on Jan 23, 2002 at 00:20 UTC
Hi Japhy. Didnt see this thread until I had finished a long reply to sickboys original. But, it turns out the lc stuff is bad advice. Its actually faster with a regex. Anyway, see below... Yves / DeMerphq -- When to use Prototypes?	[reply]
A reply falls below the community's threshold of quality. You may see it by logging in.
(ar0n) Re: grep help by ar0n (Priest) on Jan 22, 2002 at 04:19 UTC
It's an okay way, but you don't really need regular expressions to match the haircolor or the firstname, since you're just matching string constants ("brown" and "j"). And since the keys are just indices, it's probably better to use an array: `my @friends; for (@raw_list) { my @friend = split /,/; push @friends, { FIRSTNAME => $friend[0], LASTNAME => $friend[1], HAIRCOLOR => $friend[2] }; } my @results = grep { lc $_->{HAIRCOLOR} eq 'brown' and 'j' eq substr(lc $_->{FIRSTNAME}, 0, 1) } @friends;` [download] The new line in the grep block, is, of course, only a style decision. Update: Yes, I have tried it, and it works for me. What error is it giving? Update 2: You've changed the format of the input file. When splitting, spaces matter. Have you tried your own code with `"john, smith, brown"`? It doesn't work. Anyhow, replace the split statement above with `split /\s,\s/` [ ar0n -- want job (boston) ]	[reply] [d/l]
A reply falls below the community's threshold of quality. You may see it by logging in.
A reply falls below the community's threshold of quality. You may see it by logging in.
Re: grep help by demerphq (Chancellor) on Jan 23, 2002 at 00:47 UTC
Hi. Well, a couple of minor comments. First, I would say that it might be a good idea to get out of the habit of doing $$ for non simple refs. They are confusing to read and dont work for more complicated structures. Use -> as a dereference operator and your code will be more readable and there are only a very few scenarios where they cant be used. Second you are sorting your data prior to the grep, which is ok for small data sets but doesnt scale well. (Infact a close inspection of the way you are doing this indicates you actually pass over your entire data set 3 times, with the sort being treated as 1 pass which is an underestimate.) Instead sort your data AFTER the grep, this is better because there should be less records to sort. Unfortunately with your current data structure this isnt too easy, you might for instance consider putting the id inside the record as well as being the key to the hash containing the records. Third, assuming you do want the list sorted by key you might want to be aware that currently you are getting a lexicographical sort order and not a numeric order (ie lexicographical:1<10<2<20 numeric 1<2<10<20) Fourth minor point but, /^j./ is the same as /^j/. While its not bad (optimizer will sort it out I bet, er maybe not, after all) errors are minimized by saying what you mean precisely. Anyway, ill give you a couple of variants, along with some benchmarking code to show what I mean. use strict; use warnings; use Data::Dumper; use Benchmark 'cmpthese'; our %friends; our @results; my $counter=0; my @DATA=<DATA>; chomp @DATA; sub setup_data { my $repeat=shift \|\| 1; for (1..$repeat) { foreach (@DATA) { $friends{++$counter}={}; @{$friends{$counter}}{qw(ID FIRSTNAME LASTNAME HAIR)} = ($c +ounter,split(/,/, $_)); } } print Dumper(\%friends) unless $repeat>1; } sub sickboy { @results=grep ($$_{HAIR}=~ /brown/i && $$_{FIRSTNAME}=~/^j./i, @friends{sort keys %friends}); } sub sickboy_rewrite { # Minor rewrite using -> and non superfluous regex, with /o modifier a +s well @results=grep ($_->{HAIR}=~ /brown/io && $_->{FIRSTNAME}=~/^j/i, @friends{sort keys %friends}); } sub internal_id { # assuming that the ID is contained in the record, as well as being th +e key # uncomment the next line to see how this could work with your existin +g code # $friends{$_}->{ID}=$_ foreach keys %friends; @results=sort { $a->{ID} <=> $a->{ID} } #numeric sort grep { $_->{HAIR}=~/brown/i && $_->{FIRSTNAME}=~/^j/i } values %friends; } sub no_internal_id { # assuming there is _no_ way to add an ID to each record, but sort is +still after the grep @results=map { $friends{$_} } sort { $a <=> $b } grep { $friends{$_}->{HAIR}=~/brown/i && $friends{$_}->{FIRSTNAME}=~/^j/i } keys %friends; } sub test_and_display { local $\="\n"; sickboy; print Dumper(\@results); sickboy_rewrite; print Dumper(\@results); internal_id; print Dumper(\@results); no_internal_id; print Dumper(\@results); } # for debugging # setup_data; # test_and_display; for my $x (1,10,100,1_000,10_000,10_000) { setup_data($x); print "\n\nBenchmarking hash of ",scalar keys %friends," records.\ +n"; cmpthese(-30, {sickboy=>\&sickboy, sickboy_rewrite=>\&sickboy_rewrite, internal_id=>\&internal_id, no_internal_id=>\&no_internal_id, }); } __DATA__ John,Brown,red Fred,Flintstone,black Jane,Brown,blond Betty,Rubble,black John,Kennedy,brown Jodi,Brown,brown Bill,Black,blond June,Green,brown John,McEnroe,brown Erin,White,brown Sean,Comb,green Lexius,Luser,black Monty,Python,balding Joe,Bloggs,white Ian,Flemming,red Zoe,Morte,blonde Peter,Phillips,chrome [download] Running this code you will see that for trivially small data sets your algorithm is the fastest with the internal_id variant marginally slower. However as the data set scales up the performance difference becomes more and more marked. At 20k records internal_id is around 80% faster and even no_internal_id() surpassing your original. Anyway, the benchmark results are as follows, and I think its pretty clear which variant I would use... _{Benchmarking hash of 17 records. Benchmark: running internal_id, no_internal_id, sickboy, sickboy_rewrite, each for at least 30 CPU seconds... internal_id: 32 wallclock secs (31.25 usr + 0.00 sys = 31.25 CPU) @ 12040.13/s (n=376254) no_internal_id: 33 wallclock secs (31.66 usr + 0.02 sys = 31.67 CPU) @ 8681.97/s (n=274984) sickboy: 31 wallclock secs (31.38 usr + 0.00 sys = 31.38 CPU) @ 13151.01/s (n=412613) sickboy_rewrite: 32 wallclock secs (31.59 usr + 0.00 sys = 31.59 CPU) @ 13379.41/s (n=422709) Rate no_internal_id internal_id sickboy sickboy_rewrite no_internal_id 8682/s -- -28% -34% -35% internal_id 12040/s 39% -- -8% -10% sickboy 13151/s 51% 9% -- -2% sickboy_rewrite 13379/s 54% 11% 2% -- Benchmarking hash of 187 records. Benchmark: running internal_id, no_internal_id, sickboy, sickboy_rewrite, each for at least 30 CPU seconds... internal_id: 32 wallclock secs (31.64 usr + 0.00 sys = 31.64 CPU) @ 1240.35/s (n=39246) no_internal_id: 33 wallclock secs (31.48 usr + 0.00 sys = 31.48 CPU) @ 832.23/s (n=26202) sickboy: 32 wallclock secs (31.80 usr + 0.00 sys = 31.80 CPU) @ 1116.87/s (n=35513) sickboy_rewrite: 32 wallclock secs (31.66 usr + 0.00 sys = 31.66 CPU) @ 1143.45/s (n=36197) Rate no_internal_id sickboy sickboy_rewrite internal_id no_internal_id 832/s -- -25% -27% -33% sickboy 1117/s 34% -- -2% -10% sickboy_rewrite 1143/s 37% 2% -- -8% internal_id 1240/s 49% 11% 8% -- Benchmarking hash of 1887 records. Benchmark: running internal_id, no_internal_id, sickboy, sickboy_rewrite, each for at least 30 CPU seconds... internal_id: 33 wallclock secs (31.76 usr + 0.00 sys = 31.76 CPU) @ 77.54/s (n=2463) no_internal_id: 32 wallclock secs (31.61 usr + 0.00 sys = 31.61 CPU) @ 51.47/s (n=1627) sickboy: 33 wallclock secs (31.77 usr + 0.00 sys = 31.77 CPU) @ 61.61/s (n=1957) sickboy_rewrite: 33 wallclock secs (31.48 usr + 0.00 sys = 31.48 CPU) @ 62.16/s (n=1957) Rate no_internal_id sickboy sickboy_rewrite internal_id no_internal_id 51.5/s -- -16% -17% -34% sickboy 61.6/s 20% -- -1% -21% sickboy_rewrite 62.2/s 21% 1% -- -20% internal_id 77.5/s 51% 26% 25% -- Benchmarking hash of 18887 records. Benchmark: running internal_id, no_internal_id, sickboy, sickboy_rewrite, each for at least 30 CPU seconds... internal_id: 32 wallclock secs (31.45 usr + 0.00 sys = 31.45 CPU) @ 7.28/s (n=229) no_internal_id: 33 wallclock secs (31.95 usr + 0.00 sys = 31.95 CPU) @ 4.54/s (n=145) sickboy: 31 wallclock secs (30.23 usr + 0.00 sys = 30.23 CPU) @ 4.04/s (n=122) sickboy_rewrite: 31 wallclock secs (31.33 usr + 0.00 sys = 31.33 CPU) @ 4.05/s (n=127) Rate sickboy sickboy_rewrite no_internal_id internal_id sickboy 4.04/s -- -0% -11% -45% sickboy_rewrite 4.05/s 0% -- -11% -44% no_internal_id 4.54/s 12% 12% -- -38% internal_id 7.28/s 80% 80% 60% -- Benchmarking hash of 188887 records. Benchmark: running internal_id, no_internal_id, sickboy, sickboy_rewrite, each for at least 30 CPU seconds... internal_id: 31 wallclock secs (30.73 usr + 0.00 sys = 30.73 CPU) @ 0.72/s (n=22) no_internal_id: 31 wallclock secs (31.00 usr + 0.02 sys = 31.01 CPU) @ 0.42/s (n=13) sickboy: 32 wallclock secs (31.24 usr + 0.00 sys = 31.24 CPU) @ 0.29/s (n=9) sickboy_rewrite: 32 wallclock secs (31.11 usr + 0.00 sys = 31.11 CPU) @ 0.29/s (n=9) s/iter sickboy sickboy_rewrite no_internal_id internal_id sickboy 3.47 -- -0% -31% -60% sickboy_rewrite 3.46 0% -- -31% -60% no_internal_id 2.39 45% 45% -- -41% internal_id 1.40 148% 147% 71% -- Benchmarking hash of 358887 records. Benchmark: running internal_id, no_internal_id, sickboy, sickboy_rewrite, each for at least 30 CPU seconds... internal_id: 32 wallclock secs (30.25 usr + 0.00 sys = 30.25 CPU) @ 0.36/s (n=11) no_internal_id: 34 wallclock secs (32.20 usr + 0.17 sys = 32.37 CPU) @ 0.22/s (n=7) sickboy: 35 wallclock secs (33.37 usr + 0.02 sys = 33.39 CPU) @ 0.15/s (n=5) sickboy_rewrite: 34 wallclock secs (33.41 usr + 0.00 sys = 33.41 CPU) @ 0.15/s (n=5) s/iter sickboy_rewrite sickboy no_internal_id internal_id sickboy_rewrite 6.68 -- -0% -31% -59% sickboy 6.68 0% -- -31% -59% no_internal_id 4.62 44% 44% -- -41% internal_id 2.75 143% 143% 68% --} Anyway, hope that helps. `PS:I realize that much of this has already been said by japhy, originally this post was going to be to the other thread, but it seems more topical here (er, at least now that I know this thread exists). BTW:I have the feeling that you should adopt a slightly less antagonistic attitude. You'd get less hassle and much more respect.` Yves / DeMerphq -- When to use Prototypes?	[reply] [d/l]
Re: grep help by goldclaw (Scribe) on Jan 22, 2002 at 04:08 UTC
try replacing `@friends{sort keys %friends}` with `values %friends`. Even better, dont use a hash at all for %friends, use an array instead. The grep is otherwise perfectly OK. goldclaw	[reply] [d/l] [select]


Welcome to the Monastery
	PerlMonks