Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Piping many individual files into a single perl script

by kelder (Novice)
on Sep 28, 2008 at 01:01 UTC ( [id://714088]=perlquestion: print w/replies, xml ) Need Help??

kelder has asked for the wisdom of the Perl Monks concerning the following question:

I'm relatively new to perl and need some help. Basically, I have 10000 files containing a randomly generated set of strings that I need to check against a preset string. For each "data" input file, the program prints the number of times the preset string occurs to a "results" output file, with an individual line for each file that is checked. For example, the input/output might look like this:
-file1- stringa stringa stringb stringc stringc -file2- stringa stringb stringb stringb stringc output: File A B 1 2 1 2 1 3
I've written all the code to check the files, but right now I'm at a loss as to how to pipe each of the input files into the program. Is there a command for piping each file from a given directory into the program or something similar I could put at the beginning of the code? I apologize in advance if this ends up being a RTFM post, but I feel like I'm just missing something in the manuals.

Replies are listed 'Best First'.
Re: Piping many individual files into a single perl script
by BrowserUk (Patriarch) on Sep 28, 2008 at 01:23 UTC

    If you're running on win32, the ikegami's shell solution won't work for you. However, you can achieve a similar affect by adding: @ARGV = map glob, @ARGV; at the top of your program.

    A nice way to process a list of files is:

    #! perl -slw use strict; BEGIN{ @ARGV = map glob, @ARGV; } while( <> ) { ## here, $_ is every line of every file that matches the ## wildcarded paths supplied in the command line if( /the search string/ ) { ## Do something } }

    So c:>theScriptAbove.pl *.c *.h would read all C files and C header files in the current directory, and search each line of all of those files for "the search string"


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      So I tried the code you provided,and while it did pipe all the files in, in listed them all in one line in the results file.
      A sample of what I received:
      File#   A    B   C
      1       5    6   3
      
      What I wanted:
      File#   A    B   C
      1       2    1   0
      2       2    4   2
      3       1    1   1
      
      This is my code:
      #!C:\Perl use strict; BEGIN{ @ARGV=map glob, @ARGV; } open(RES, ">>results.txt"); print RES "File Number A A% B B% Null Null%\n"; my $A=0; #these three lines set my initial counts at zero my $B=0; my $null=0; my $filenum=0; while (<>){ chomp($_); if ($_ eq "stringa"){ $A++;} elsif ($_ eq "stringb"){ $B++;} else { $null++; } } my $popa=$A/1000; #these lines determine what percent of the populatio +n the strings represent my $popa=sprintf('%.2f',$popa); #cut the percentages to two decimal pl +aces my $popb=$B/1000; my $popb=sprintf('%.2f',$popb); my $popnull=$null/1000; my $popnull=sprintf('%.2f',$popnull); my $filenum++; #Add one to my filenumber print RES "$filenum $A $popa $B $popb $null $pop +null\n"; #print the results out to the "results" file
      What am I doing wrong? Edit: Thanks for the help so far!

        You need to detect the end of each individual file, print your results for that file and reset the counts. See the explanation of eof(ARGV) in perlfunc:

        #!C:\Perl use strict; BEGIN{ @ARGV=map glob, @ARGV; } open(RES, ">>results.txt"); print RES "File Number A A% B B% Null Null%\n"; my $A = 0; #these three lines set my initial counts at zero my $B = 0; my $null = 0; my $filenum = 0; while( <> ){ chomp($_); if ($_ eq "stringa"){ $A++;} elsif ($_ eq "stringb"){ $B++;} else { $null++; } if( eof( ARGV ) ) { ## true after the end of each individual file my $popa = sprintf( '%.2f', $A / 1000 ); my $popb = sprintf( '%.2f', $B / 1000 ); my $popnull = sprintf( '%.2f', $null / 1000 ); my $filenum++; #Add one to my filenumber print RES "$filenum $A $popa $B $popb $null + $popnull\n"; $A = $B = $null = 0; ## Reset counts for the next file } }

        As I mentioned above, if OS/X is a *nix-like system, you probably don't need the @ARGV = map glob, @ARGV as the shell will take care of that for you. (Though it probably won't do any harm.)

        Also, in your code you have several place where you do:

        ... my $var = ....; my $var = sprintf ... $var; ...

        If you are running with strict and warnings, you should be getting messages of the form: "my" variable $var masks earlier declaration in same scope at .......don't ignore them, they are there for a purpose.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
        I'm a little confused. You said you are running on macosx, but your code starts with:
        #!C:\Perl
        That makes no sense, and it entails that you can only run the script with a command line like this:
        perl path/file_name_of_script arg1 ...
        (where the "path/" part is only needed if the script is not in your shell's current working directory). I would use this as the initial "shebang" line:
        #!/usr/bin/perl
        because macosx is really a unix OS, and in unix, perl is typically found in the /usr/bin/ directory; macosx definitely has perl in that location. With that as the shebang line, and doing the shell command "chmod +x file_name_of_script", the script becomes available as a shell command:
        path/file_name_of_script arg1 ...
        where the "path/" part is only needed if your shell PATH variable does not include the directory where the script is stored.

        As for your question about iterating over a list of file names, a method that I find useful goes like this: the perl script expects as input a list of file names, loads those into an array, and then iterates over the array. At each iteration, if there's a problem with the file or its contents, issue a warning and skip on to the next file in the list; e.g.:

        #!/usr/bin/perl use strict; use Getopt::Long; my $Usage = "Usage: $0 [-p path] filename.list\n or: ls [path] | $0 +[-p path]\n"; my $path = '.'; die $Usage unless ( GetOptions( 'p=s' => \$path ) and -d $path ); die $Usage if (( @ARGV and !-f $ARGV[0] ) or ( @ARGV==0 and -t )); # need file name args or pipeline input my @file_list = <>; # read all input as a list of file names chomp @file_list; # get rid of line-feeds for my $name ( @file_list ) { my $file = "$path/$name"; if ( ! -f $file ) { warn "input value '$file' does not seem to be a data file; ski +pped\n"; next; } if ( ! open( I, "<", $file )) { warn "open failed for input file '$file'; skipped\n"; next; } ... }
        There are already very good shell command tools for creating a list of file names ("ls", "find"), and for filtering lists ("grep"), so I'm inclined not to rewrite those things in a perl script that is supposed to process a list of file names.

        The exception to that rule is when the script is really intended for a specific task that always involves a specific location and/or filter for getting its list of file names to work on, because in that case, I'd rather not have to repeat the selection process on the command line every time I run the script.

        $A and $B are the running totals for all of the files. You either need to make them arrays (indexed by file), or you need to print the totals when you reach the end of a file (after which, you would reset the variables to zero).

        I like BrowserUk's solution below, except that I'd probably rewrite it (I mostly didn't like the chained if-elsif's) in a manner similar to (untested:)

        #/usr/bin/perl use strict; use warnings; use 5.010; BEGIN{ @ARGV=map glob, @ARGV } print "File Number A A% B B% Null Null%"; my $default = ''; # set to something sensible, the empty string seems + good. my @allowed = (qw/stringa stringb/, $default); my (%count, $filenum); while(<>) { chomp; $count{$_ ~~ @allowed ? $_ : $default}++; if (eof) { $filenum++; say "$filenum ", join ' ' => map { my $x=$count{$_}; $x, sprintf('%.2f', $x/1000) } @allo +wed; @count{@allowed}=(0) x @allowed; } } __END__

        I threw in some 5.10-isms in the course of doing so, but it wouldn't be terribly different with pre-5.10 exists.

        --
        If you can't understand the incipit, then please check the IPB Campaign.
      Actually, I'm running this script on a Mac running OS/X. Is the script still the same?
Re: Piping many individual files into a single perl script
by ikegami (Patriarch) on Sep 28, 2008 at 01:13 UTC

    As an alternative to iterating over @ARGV and opening each file yourself, you could use a combination of <> and $ARGV:

    my %total; my %by_file; while (<>) { chomp; $total{$_}++; $by_file{$ARGV}{$_}++; }

    A simple output routine:

    my @strings = sort keys %total; for my $string ( @strings ) { print( "\t$string" ); } print( "\n" ); for my $file ( keys %counts ) { print( $file ); for my $string ( @strings ) { print( "\t$by_file{$file}{$string}" ); } print( "\n" ); }

    The file names are passed to the script in same fashion as in my first post.

Re: Piping many individual files into a single perl script
by ikegami (Patriarch) on Sep 28, 2008 at 01:05 UTC

    I'm at a loss as to how to pipe each of the input files into the program.

    program.pl file*

    which is the same as

    program.pl file1 file2 ...

    Then get the names of the files from @ARGV

    Update: Oops, not quite the same. On Windows, you'll need to do

    use File::Glob qw( bsd_glob ); @ARGV = map bsd_glob($_), @ARGV;

      I personally believe that as a very minor side note, since the OP mentions something like "10000 files," some shells do have problem with a large amount of files. Except that I can't remember how large is large. But I have seen error messages for something like "command line too long" occasionally. In that case, adopting the same technique as for Windows, would be a cure...

      use File::Glob qw( bsd_glob ); @ARGV = map bsd_glob($_), @ARGV;

      This sounds very wrong [Update: but it is right, see my reply to ikegami's comment] unless one has a good reason to do so: since we're on Windows, we most probably want DOS/Windows-like globbing and glob is dwimmy enough to select its own correct "incarnation:" in all of my scripts that may want globbing, written for Windows or "ported" (what a big word!) there from *NIX, I include the same code as BrowserUk's. Sometimes, (depending on how "important" the app will be...) I also provide "standard" -i and -o cli switches for input and output, since shell redirection has some very little but not null deficiencies.

      --
      If you can't understand the incipit, then please check the IPB Campaign.

        I include the same code as BrowserUk's

        But that uses bsd_glob as well. And worse yet, a version that breaks when a space is present.

        Its not exactly files, its characters :) for example, WinXP its 8191, Win2k/NT4 its 2047

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://714088]
Approved by ikegami
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (4)
As of 2024-04-19 13:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found