Perl program to search files

harangzsolt33 has asked for the wisdom of the Perl Monks concerning the following question:

I have created a little perl script that searches documents for a kind of binary string and lists the files that contain that exact string. Can you make suggestions on how I could improve this program or what I could have done to make this better? Also, if you see some obvious bugs in my program, would you please tell me? Thank you! Here is my script:

Edit: Okay, as requested, I am inserting the code here instead of just a link. And thank you for your ideas and advice. I will read through them more carefully tomorrow and try to implement them! Thank you!


#!/usr/bin/perl

#
# This Perl program looks for a certain binary string
# in a file and lists the files found and the total number
# of files found. I tested this script in TinyPerl 5.8.
#
#################################################################

use strict;
use warnings;

my $FIND            = ".\nThis is";
my $CASE_SENSITIVE  = 0;
my $FOLDER          = 'H:\\HOME';
my $RECURSIVE       = 1;
my @INCLUDE         = qw( *.txt *.doc *.xls *.ppt *.log *.htm *.html *
+.xml *. *.eml *.pdf *.rtf *.bas *.c *.h *.pl *.cgi *.bak *.reg *.ini 
+*.inf *.ion *.1st *.lst *.pas *.py *.cpp *.php *.xfm *.js *.asm *.rt 
+*.old );
my $VERBOSE         = 0;
my $PAGE_BREAK      = 0;

################################################################# Main
###
###    PROGRAM STARTS HERE:
##
#

my $TOTAL = 0;
my $BUFF_SIZE = 4000000;
my $LINUX = (index(uc($^O), 'WIN') < 0) ? 1 : 0;
$CASE_SENSITIVE or $FIND = uc($FIND);
CheckDIR($FOLDER);
print "\n", '-' x 72;
print "\n $TOTAL file(s) found with matching string\n";
exit;

################################################################# Sear
+chFile
#
# This function is automatically called by CheckDIR() every time
# a file is found. This function gets the path & name of the file.
#
#   Usage: SearchFile(PATH, NAME) <-- Called by CheckDIR()
#
sub SearchFile
{
  my $PATH = $_[0];
  my $NAME = $_[1];
  my $FULLNAME = "$PATH$NAME";

  (-f $FULLNAME) or return;  # not a plain file?
  my $L = -s $FULLNAME;      # Get file size
  $L or return;              # file size == 0?

  # Is the search string longer than
  # the file we're searching?
  $L >= length($FIND) or return;

  my $F;
  my $START = 0;
  my $FOUND = 0;
  my $BUFFER;
  open $F, "<$FULLNAME" or return;
  binmode $F;
  while ($START < $L)
  {
    if ($START > $L) { last; }
    read $F, $BUFFER, $BUFF_SIZE, $START;
    if (defined $BUFFER)
    {
      if (!$CASE_SENSITIVE) { $BUFFER = uc($BUFFER); }
      if (index($BUFFER, $FIND) >= 0)
      { $FOUND++; last; }
    }
    else
    { last; }
    if (length($BUFFER) < $BUFF_SIZE) { last; }
    $START += $BUFF_SIZE - length($FIND);
  }
  close $F;
  if ($FOUND)
  {
    $TOTAL++;
    if ($PAGE_BREAK)
    {
      if ($TOTAL % $PAGE_BREAK == 0)
      { print "\nPress ENTER to continue..."; <STDIN>; }
    }
    PRINT("FOUND: $FULLNAME");
  }
}
################################################################# Chec
+kDIR
#
# This function reads the contents of a folder and calls
# SearchFile() for each file that was found.
#
#   Usage: CheckDIR(PATH)
#
sub CheckDIR
{
  @_ or return;
  my $FULLNAME;
  my $PATH = shift;
  defined $PATH or return;
  length($PATH) or return;

  if ($VERBOSE) { PRINT("SEARCHING: ", FormatPath($PATH)); }
  $PATH = AddSeparator($PATH);
  opendir(my $DIR, $PATH) or return;
  while (my $NAME = readdir $DIR)
  {
    $FULLNAME = "$PATH$NAME";
    if (-d($FULLNAME))
    {
      # Check into subdirectory if RECURSIVE == 1
      # Skip directory if its name starts with "."
      if ($RECURSIVE)
      { CheckDIR($FULLNAME) unless (vec($NAME, 0, 8) == 46); }
      next;
    }

    # Scan file contents only if the file name
    # matches a certain pattern.
    foreach my $W (@INCLUDE)
    {
      if (isMatch($NAME, $W))
      { SearchFile($PATH, $NAME); }
    }
  }
  closedir $DIR;
}
################################################################# isMa
+tch
#
# This function returns 1 if a filename matches a certain
# wildcard pattern. There may be several question marks in
# the search pattern, but only one asterisk is allowed!
# The asterisk may be at the beginning, the end, or in the
# middle of the pattern. A question mark matches any
# character. An asterisk matches zero or more characters.
# The matching is NOT case sensitive!
#
#   Usage: INTEGER = isMatch(FILENAME, WILDCARD)
#
# Example: isMatch("New_Document.txt", "n*.txt") ---> 1
#
sub isMatch
{
  @_ > 1 or return 0;
  my $F = shift; defined $F or return 0; length($F) or return 0;
  my $W = shift; defined $W or return 0; length($W) or return 0;
  $F = uc($F);
  $W = uc($W);

  # If there aren't any wildcards at all...
  if (CountChars($W, '*?') == 0) { return ($F eq $W) ? 1 : 0; }

  # Match what's before the asterisk...
  return 0 unless (_isMatch($F, $W, 1));

  # Match what comes after the asterisk...
  return _isMatch($F, $W, -1);
}
################################################################# _isM
+atch
#
# This private function is called by isMatch()
# This function compares two strings and returns 1 if
# both strings match until the first asterisk.
# This function can start comparing strings starting
# from the beginning or starting from the end!
# DIRECTION must be either 1 or -1.
#
#   Usage: INTEGER = _isMatch(FILENAME, WILDCARD, DIRECTION)
#
sub _isMatch
{
  my $F = shift; my $f; my $LF = length($F)-1;
  my $W = shift; my $w; my $LW = length($W)-1;
  my $DIR = shift;
  my $STOP = $LW; my $START = 0; my $FSTART = 0;
  if ($DIR < 0) { $STOP = 0; $START = $LW; $FSTART = $LF; }
  while ($START != $STOP)
  {
    $w = vec($W, $START, 8);  # Grab byte from wildcard pattern
    $f = vec($F, $FSTART, 8); # Grab byte from filename
    $START += $DIR;
    $FSTART += $DIR;
    if ($w == 42)   # ASTERISK?
    { return 1; }
    else
    {
      # If the character is "?" then skip, but if
      # it's not "?", then the characters must match.
      if ($w != 63) { ($f == $w) or return 0; }
    }
  }
  return 1;
}
################################################################# PRIN
+T
#
# This function prints the total number of matches found
# so far followed by some text and a new line character.
#
#   Usage: PRINT(STRING)
#
sub PRINT { print ' ', RSPACE(8, $TOTAL), @_, "\n"; }
################################################################# Form
+atPath
#
# This function replaces the \\ or / characters in a path
# to match the current OS's separator character.
#
#   Usage: PATH = FormatPath(PATH)
#
sub FormatPath
{
  @_ or return '';
  my $PATH = $_[0];
  length($PATH) or return '';
  if ($LINUX)
  { $PATH =~ tr|\\|/|; }
  else
  { $PATH =~ tr|/|\\|; }
  return $PATH;
}
################################################################# AddS
+eparator
#
# This function makes sure that the given path ends with
# a / or \ depending on the current OS.
#
#   Usage: PATH = AddSeparator(PATH)
#
sub AddSeparator
{
  @_ or return '';
  my $PATH = $_[0];
  length($PATH) or return '';
  my $c = vec($PATH, length($PATH)-1, 8);
  return FormatPath($PATH . (($c == 47 || $c == 92) ? '' : '/') );
}
################################################################# RSPA
+CE
#
# This function adds spaces to the end of a string list to
# make sure it's exactly N characters long. If the string
# is longer than N, then just returns the string itself.
#
#   Usage: STRING = RSPACE(N, STRING, [STRINGS...])
#
sub RSPACE
{
  @_ > 1 or return '';
  my $LEN = shift;
  my $S = join('', @_);
  my $L = length($S);
  $L < $LEN or return $S;
  return $S . (' ' x ($LEN - $L));
}
################################################################# Coun
+tChars
#
# This function counts how many times STRING contains
# any of the characters of SUBSTR.
#
#   Usage: INTEGER = CountChars(STRING, SUBSTR)
#
sub CountChars
{
  @_ > 1 or return 0;
  my $S = shift; defined $S or return 0; length($S) or return 0;
  my $L = shift; defined $L or return 0; length($L) or return 0;
  my $P;
  my $c;
  my $i = length($L);
  my $N = 0;
  while ($i-- > 0)
  {
    $P = 0;
    $c = substr($L, $i, 1);
    while (($P = 1+index($S, $c, $P)) > 0)
    { $N++; }
  }
  return $N;
}
#################################################################
[download]

Btw I tested the program, and it seems to work fine.

Comment on Perl program to search files Download Code

Replies are listed 'Best First'.

Re: Perl program to search files
by haukex (Archbishop) on Dec 26, 2018 at 10:07 UTC

if you see some obvious bugs in my program, would you please tell me?

Unfortunately your algorithm suffers from an issue: when the text to be found happens to be split across two buffers, the match is not found. To demonstrate, try setting $BUFF_SIZE = 4096, and then generating a test file with e.g. perl -e 'print "x"x4095, ".\nThis is", "x"x5000' >test.txt - the match won't be found, but if you either set $BUFF_SIZE = 4095, or change the "x"x4095 to "x"x4096, the match is found. A typical solution to this kind of problem would be to implement a sliding window (Update 2: see reply by vr below), making sure that the buffers are large enough so that the search string can be contained in them. Alternatively, if the files are always large enough to fit into memory, of course it's possible to just load the entire file into memory at once. You might want to have a look at these nodes, for example: Matching in huge files and Re: Search hex string in vary large binary file.

Can you make suggestions on how I could improve this program or what I could have done to make this better?

As Athanasius already mentioned, there are some things you're re-implementing. This is fine as a learning exercise, but I think it also helps to be aware of these kinds of things. It also gives you something to test your own implementations against (which I would recommend doing).

sub CountChars: How can I count the number of occurrences of a substring within a string?
sub RSPACE and sub PRINT: sprintf/printf (as mentioned by Athanasius)
sub AddSeparator and sub FormatPath: the core module File::Spec, in this case especially catdir and catfile (I prefer File::Spec::Functions; there's also the CPAN module Path::Class)
sub isMatch: maybe Text::Glob (there are a few others on CPAN)
sub CheckDIR: the core module File::Find (or maybe File::Find::Rule from CPAN)

In addition, I agree that variable names in all uppercase should be reserved for constants, IMO including variables that get set once at the top of the program and aren't supposed to be changed, so e.g. IMO $LINUX is fine, but e.g. $START is IMO not a good choice. Also I agree that magic numbers aren't good, although I would take it a step further, e.g. my $ASTERISK = ord("*");, my $DOT = ord(".");, etc. Some more descriptive variable names would be helpful in understanding the code as well, e.g. sub CountChars is all one-letter variables.

A couple of other thoughts/issues:

my $LINUX = (index(uc($^O), 'WIN') < 0) ? 1 : 0; - this will misdetect darwin as Windows (or rather, "not Linux", despite it being a *NIX OS), and there are lots of other OSes that are neither Linux nor Windows. You might just want to stick to checking against "MSWin32" - but as I said above really the best solution here IMO is to leave the handling of filenames to a module.
opendir(my $DIR, $PATH) or return; and open $F, "<$FULLNAME" or return; - you might want to consider reporting this to the user instead of silently skipping files/dirs.
You might want to have a look at Getopt::Long to be able to handle command-line arguments instead of editing the variables in the source.
Update: Instead of a long list of globs in @INCLUDE, you could build a single regex to match against the filenames. See e.g. Building Regex Alternations Dynamically.

[reply]
[d/l]
[select]

Re^2: Perl program to search files

by vr (Curate) on Dec 26, 2018 at 12:12 UTC

Minor note:

A typical solution to this kind of problem would be to implement a sliding window

But he is trying to slide:

$START += $BUFF_SIZE - length($FIND);

Unfortunately, it seems that his 4th argument to read is understood as "offset into file", while, of course, it is "offset into buffer". If this

read $F, $BUFFER, $BUFF_SIZE, $START;

is replaced with

seek $F, $START, 0;
read $F, $BUFFER, $BUFF_SIZE;
[download]

then his sliding window works.

[reply]
[d/l]
[select]

Re^3: Perl program to search files

by haukex (Archbishop) on Dec 26, 2018 at 12:47 UTC

But he is trying to slide

Ah, you are correct, thank you!

[reply]

Re: Perl program to search files
by Athanasius (Archbishop) on Dec 26, 2018 at 04:31 UTC

Hello harangzsolt33,

One thing I notice is that your sub RSPACE is essentially duplicating part of the functionality already present in Perl’s printf and sprintf builtin functions. In fact, your functions RSPACE and PRINT can be removed altogether, and the two places where PRINT is called can be written as follows:

# (1) in sub SearchFile:

PRINT("FOUND: $FULLNAME");                       # original
printf " %-8dFOUND: %s\n", $TOTAL, $FULLNAME;    # replacement

# (2) in sub CheckDIR:

if ($VERBOSE) { PRINT("SEARCHING: ", FormatPath($PATH)); }            
+   # original
printf " %-8dSEARCHING: %s\n", $TOTAL, FormatPath($PATH) if $VERBOSE; 
+   # replacement
[download]

Note the use of a statement mofifier in the second case. In general, statement modifiers should be preferred over compound statements whenever a single statement is involved, as this simplifies the code.

Some of your coding conventions could also be improved (IMHO):

The use of all-uppercase variable names should be reserved for constants.

Magic numbers (even ASCII values) should be replaced by constants:

use Const::Fast;

const my $ASTERISK => 42;
const my $DOT      => 46;
const my $QUESTION => 63;
# etc.
[download]

In a script intended to be called directly from the command line, and not used as a module, the distinction between “public” and “private” functions is pointless. (Conversely, if the script is intened to be called as a module, it should begin with an appropriate package declaration.)

I haven’t studied your code in detail. Overall, it looks good: well thought-out and implemented.

Regarding posting:
(1) It would be better to include your code in the question itself (inside <readmore> tags!) than to provide a web address, as the latter is likely to become invalid and so render this thread incomprehensible to future readers.
(2) You say you have tested the program. Adding your test code (again, in <readmore> tags, of course) would aid the monks in evaluating your code.

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

Re^2: Perl program to search files

by Tux (Canon) on Dec 28, 2018 at 13:18 UTC

Note the use of a statement modifier in the second case. In general, statement modifiers should be preferred over compound statements whenever a single statement is involved, as this simplifies the code.

I STRONGLY oppose to this statement. This might be true for you, but I personally hate statement modifiers used for this, as it causes me to read code backwards, which is a maintenance nightmare.

IF simplification to one-liners is required, I use it the other way round (TIMTOWTDI)

if ($VERBOSE) { PRINT ("SEARCHING: ", FormatPath ($PATH)); }          
+    # original

$VERBOSE and printf "" %-8dSEARCHING: %s\n", $TOTAL, FormatPath ($PATH
+);  # Readable and understandable

printf " %-8dSEARCHING: %s\n", $TOTAL, FormatPath ($PATH) if $VERBOSE;
+    # unmaintanable
[download]

Enjoy, Have FUN! H.Merijn

[reply]
[d/l]

Re^3: Perl program to search files

by eyepopslikeamosquito (Archbishop) on Dec 28, 2018 at 23:05 UTC

Agree. This seems to be a matter of personal taste as indicated by the discussion at: A matter of style: how to perform a simple action based on a simple condition?

In addition to preferring block-if to postfix-if I always use the multi-line form of block-if. Why? Consider changing:

if ($VERBOSE) {
   PRINT("SEARCHING ...");
}
[download]

if ($VERBOSE) {
   # Do something else here ... or just add a clarifying comment
   PRINT("SEARCHING ...");
}
[download]

block-if

one line

postfix-if

[reply]
[d/l]
[select]

Re: Perl program to search files
by Anonymous Monk on Dec 28, 2018 at 14:13 UTC

A great example of how not to write even basic programs. You code also doesn't do what you claim it does, throw in reinventing the wheel, and making it worse than established methods.

[reply]


Problems? Is your data what you think it is?
	PerlMonks