Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Yet Another threaded find|grep perl utility

by QuillMeantTen (Friar)
on Nov 04, 2015 at 11:10 UTC ( [id://1146897]=CUFP: print w/replies, xml ) Need Help??

Greetings, fellow monks
Today I was given a C assignment : writing a program that once given a top directory to search, a filename to search for and a max number of coexisting threads will return the absolute path of the file if it is found inside the directory or the paths of multiple similarly named files

The assignment is quite specific on the way it should be done:

A main thread creates a named pipe and then uses it to receive subdirectory paths. Those subdirectory paths are used to create new threads that will look for the specified filename inside the directory and send back subdirectories to the main thread.

I decided to indulge in my new favorite vice : before getting down and dirty with C, write a high level Perl prototype to spot the potential bugs and hurdles and make the C code writing easier.
Here it is :D
I hope you enjoy reading it, it is by no mean a better mousetrap, just my take on an interesting problem under specific constraints.

As usual, even if that last line makes this post improper for CUFP (at least for one person who told me so):
If you see bad practices, things that can be made better, do post about them, no point in making mistakes if I dont learn from them :)

Update: The teacher updated the requirements, making it a bit more fun, now I have to look for a string in any file named *.txt or if both string and filename specified find said file and look for target string inside it.

Also I started using getopt::std to make option handling cleaner also the script now displays the permissions for each file/dir found. enjoy!

After re reading the assignment I put in some more changes. I am almost through with the C implementation. The striking difference is the number of lines written: 500 to 205 in Perl. Here is the last version of the perl prototype:

#!/usr/bin/perl use feature 'state'; use strict; use warnings; use autodie; use threads; use threads::shared; use POSIX qw(mkfifo); use Carp qw(croak carp confess); use File::Spec; use Getopt::Std; use File::Touch; sub look_for_string{ my ($target,$file) = @_; my @target_chars = split(//,$target); open my $fh, '<',$file; my $readchars; my $line = 0; while(read($fh,$readchars,512)!= 0){ my @chars = split(//,$readchars);#this is a prototype for a c +program, so lets if($chars[$#chars] ne "\n" && $chars[$#chars] ne " " && $chars[$#chars] ne "\0"){ my $i = 1; my $curpos = tell($fh); while($chars[$#chars-$i] ne "\n" && $chars[$#chars-$i] ne +" " && $chars[$#chars-$i] ne "\0"){ $i++; } seek($fh,$curpos,$i);#if word is cutoff we rewind } NEXT_CHAR:for my $i ($#target_chars .. $#chars){ if($chars[$i] eq "\n"){ $line++; } if($chars[$i] eq "$target_chars[$#target_chars]"){ for my $j ( 1 .. $#target_chars){ if($target_chars[$#target_chars-$j] ne "$chars[$i-$j +]"){ next NEXT_CHAR; } } close $fh; return $line;#return line number where found } } } close $fh; return undef; } sub printattr{ my $file = shift; my @fileattr = split(//,(stat($file))[2]); my @printable = qw( --- --x -w- -wx r-- r-x rw- rwx ); print "["; for my $attr (@fileattr){ print $printable[$attr % 8]; } print "]\n"; } sub explore{ my ($dir,$lock,$backpipe,$backlock,$target,$text_target) = @_; my @files = glob "$dir/*"; my $found=undef; foreach my $file (@files){ if(-d $file){ if($file =~ m/\.txt\z/){ if(defined $target && $file=~m/$target/gxms){ print $file." <REP> "; printattr($file); } } lock($backlock); open my $fh,'>',$backpipe; print $fh "$file\n"; close $fh; } elsif(defined $target){ if($file eq $dir."/".$target){ if(defined $text_target && $file =~ m/\.txt\z/ && ! -e "$file.lck"){#check if file already opened by other th +read #in case some symlink sent it here lock($lock); touch("$file.lck"); $found = look_for_string($text_target,$file); unlink("$file.lck"); } lock($lock); if(defined $found){ print "found $text_target at line $found of file:$ +file "; printattr($file); } } } else{ if(defined $text_target && $file =~ m/\.txt\z/ && ! -e "$f +ile.lck"){ touch("$file.lck"); $found = look_for_string($text_target,$file); unlink("$file.lck"); } lock($lock); if(defined $found){ print "found $text_target at line $found of file:$file + "; printattr($file); } } } threads->exit(); } sub nb_running_threads{ my $threads = shift; my $res = 0; for my $th (@$threads){ $res += $th->is_running(); } return $res; } sub main{ my $args = shift; my $tmp= File::Spec->rel2abs($args->[1]); my $target = $args->[0]; my $max_threads = $args->[2]; my $text_target = $args->[3]; my %explored;#hash containing already explored paths if(!defined($max_threads)){ $max_threads = 8; } my $backpipe = "backpipe"; if(-e $backpipe){ unlink $backpipe; } mkfifo($backpipe,0700) or croak "could not make backpipe: $!"; $backpipe = File::Spec->rel2abs($backpipe); my $lock :shared; my $backlock :shared; my @thread_ids; $thread_ids[0] = threads->create('explore',$tmp,\$lock,$backpipe,\$backlock,$target +,$text_target);#now the #first exploring thread to get the pump primed my $thc = 1; open my $fhbk, '<',$backpipe; while(nb_running_threads(\@thread_ids)){ while(<$fhbk>){ chomp($_); my $tmp_c; if($thc == $max_threads -1 ){ print "waiting for thread to die\n"; while(!defined($tmp_c)){ for my $i (0 .. $#thread_ids){ if(! $thread_ids[$i]->is_running()){ $thread_ids[$i]->join(); $tmp_c = $i; last; } } } } if(defined($tmp_c) && !defined $explored{$_}){ $explored{$_} = 1; $thread_ids[$tmp_c] = threads->create('explore',$_,\$lock,File::Spec->rel2ab +s($backpipe),\$backlock,$target,$text_target); } elsif(!defined $explored{$_}){ $explored{$_} = 1; $thread_ids[$thc] = threads->create('explore',$_,\$lock,File::Spec->rel2ab +s($backpipe),\$backlock,$target,$text_target); $thc++; } } } #lets close shop for my $th (@thread_ids){ $th->join(); } #cleanup unlink $backpipe; } our ($opt_f,#filename switch $opt_d,#dir switch $opt_t,#thread number switch $opt_m);#target text switch getopts('f:d:t:m:'); my @shortargs= ($opt_f,$opt_d,$opt_t,$opt_m); if(!defined($shortargs[1]) || (!defined $shortargs[0] && !defined $shortargs[3])){ croak <<"END" use me -f filename -d topdir -t max_thread_number and -m target_te +xt END } main(\@shortargs);

Replies are listed 'Best First'.
Re: Yet Another threaded find|grep perl utility
by hippo (Bishop) on Nov 04, 2015 at 13:59 UTC
    I decided to indulge in my new favorite vice : before getting down and dirty with C, write a high level Perl prototype to spot the potential bugs and hurdles and make the C code writing easier.

    That's an excellent strategy (hardly a "vice" but I think you knew that :-) and one which would serve anybody well. It might be interesting for you to benchmark both solutions to each problem you tackle in this way as that will give you a feel for what level of performance boost you could get with C compared with the extra development effort - a useful experience to acquire.

      Great idea, I finished updating my prototype now that the assignment has been modified one last time, I will post the C code and the comparison here!

Re: Yet Another threaded find|grep perl utility
by Athanasius (Archbishop) on Nov 05, 2015 at 15:54 UTC

    Hello QuillMeantTen,

    If you see ... things that can be made better, do post about them ...

    This sub caught my eye:

    sub printattr{ my $file = shift; my @fileattr = split(//,(stat($file))[2]); print "["; for my $attr (@fileattr){ if($attr eq 0){ print "---"; } elsif($attr eq 1){ print "--x"; } elsif($attr eq 2){ print "-w-"; } elsif($attr eq 3){ print "-wx"; } elsif($attr eq 4){ print "r--"; } elsif($attr eq 5){ print "r-x"; } elsif($attr eq 6){ print "rw-"; } else{ print "rwx" } } print "]\n"; }

    It can be written more succinctly using an array (or, better, a persistent array reference) for lookup:

    use feature 'state'; sub printattr { state $attrs = [ qw( --- --x -w- -wx r-- r-x rw- rwx ) ]; state $maxi = $#$attrs; my ($file) = @_; print '['; print $attrs->[$_ < $maxi ? $_ : $maxi] for split //, (stat $file) +[2]; print "]\n"; }

    Update: Since each array index is limited to the range 0 .. 9, it’s simpler and more efficient to just extend the array:

    sub print_attr { state $attrs = [ qw( --- --x -w- -wx r-- r-x rw- rwx rwx rwx) ]; my ($file) = @_; print '['; print $attrs->[$_] for split //, (stat $file)[2]; print "]\n"; }

    Hope that’s of interest,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

      Great advice, I found my first solution quite unelegant, I'll rewrite it that way.

Re: Yet Another threaded find|grep perl utility
by BrowserUk (Patriarch) on Nov 05, 2015 at 16:06 UTC

    This:sub nb_running_threads() can be replaced by:

    my $running = threads->list( threads::running );

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Yet Another threaded find|grep perl utility
by Anonymous Monk on Nov 05, 2015 at 16:56 UTC

    In C, one need not search character by character either. There are strstr and memmem (index in perl).

    And... "I've updated the requirements once, pray that I update them no further." Better ask your teacher if the search string may include newlines or not. In case there's no line length limit, one might (ab)use the getline on the C side.

Re: Yet Another threaded find|grep perl utility
by Preceptor (Deacon) on Nov 06, 2015 at 12:21 UTC

    Be slightly cautious when trying to parallelise disk IO. You can end up with some rather unexpected consequences. You can't change the laws of physics - a rotating platter and moving head still has to move to find data.

    So if you start reading from mulitple points on the disk at once, you will create contention and slow down the overall result because of the seek time overhead. You will also make it harder for your OS to predict and prefetch data into cache, which can also reduce performance. Depends on what other workloads and limiting factors are present though - typically the storage subsystem is one of the slowest in your computer system

Re: Yet Another threaded find|grep perl utility
by QuillMeantTen (Friar) on Nov 09, 2015 at 07:31 UTC

    File reading wont have to be done in parallel, it will have to be done on discrete 512 bytes blocks though.

    I am messing around with the prototype in multiple branches so excuse me if it has not yet all the changes proposed here, they are in other branches staged for merging :)

    Also THANK YOU SO MUCH FOR STRSTR I had no idea such thing did exist!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: CUFP [id://1146897]
Approved by LanX
Front-paged by kcott
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (4)
As of 2024-04-19 04:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found