http://qs321.pair.com?node_id=1227704


in reply to Perl program to search files

if you see some obvious bugs in my program, would you please tell me?

Unfortunately your algorithm suffers from an issue: when the text to be found happens to be split across two buffers, the match is not found. To demonstrate, try setting $BUFF_SIZE = 4096, and then generating a test file with e.g. perl -e 'print "x"x4095, ".\nThis is", "x"x5000' >test.txt - the match won't be found, but if you either set $BUFF_SIZE = 4095, or change the "x"x4095 to "x"x4096, the match is found. A typical solution to this kind of problem would be to implement a sliding window (Update 2: see reply by vr below), making sure that the buffers are large enough so that the search string can be contained in them. Alternatively, if the files are always large enough to fit into memory, of course it's possible to just load the entire file into memory at once. You might want to have a look at these nodes, for example: Matching in huge files and Re: Search hex string in vary large binary file.

Can you make suggestions on how I could improve this program or what I could have done to make this better?

As Athanasius already mentioned, there are some things you're re-implementing. This is fine as a learning exercise, but I think it also helps to be aware of these kinds of things. It also gives you something to test your own implementations against (which I would recommend doing).

In addition, I agree that variable names in all uppercase should be reserved for constants, IMO including variables that get set once at the top of the program and aren't supposed to be changed, so e.g. IMO $LINUX is fine, but e.g. $START is IMO not a good choice. Also I agree that magic numbers aren't good, although I would take it a step further, e.g. my $ASTERISK = ord("*");, my $DOT = ord(".");, etc. Some more descriptive variable names would be helpful in understanding the code as well, e.g. sub CountChars is all one-letter variables.

A couple of other thoughts/issues:

Replies are listed 'Best First'.
Re^2: Perl program to search files
by vr (Curate) on Dec 26, 2018 at 12:12 UTC

    Minor note:

    A typical solution to this kind of problem would be to implement a sliding window

    But he is trying to slide:

    $START += $BUFF_SIZE - length($FIND);

    Unfortunately, it seems that his 4th argument to read is understood as "offset into file", while, of course, it is "offset into buffer". If this

    read $F, $BUFFER, $BUFF_SIZE, $START;

    is replaced with

    seek $F, $START, 0; read $F, $BUFFER, $BUFF_SIZE;

    then his sliding window works.

      But he is trying to slide

      Ah, you are correct, thank you!