Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Immediately writing the results of search-and-replace

by Anonymous Monk
on Aug 05, 2022 at 17:56 UTC ( [id://11145971]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Most gracious monks: I am a Perl novice working on a script that takes a CSV file as input, uses the first element of each row as a string to search on, and the second element of each row as the replacement string. In the event that one search term appears in multiple rows, these are merged together so that the second element becomes an array of replacement candidates, reduced down to unique items. In either case, the user is prompted to adjudicate the replacement (either yes/no for a single replacement candidate, or a numeric input for multiple). I have the core functionality working, but some problems crop up when doing replacements on search strings for which other search strings are substrings.

For example, if my script finds that a file contains the search string 'IP whitelist' on a given line, even if I confirm that I want to replace this with 'IP access list', it will then prompt me to replace just 'whitelist' with one of its various candidates. What I would expect to happen in that case is that the replacement of 'IP whitelist' with 'IP access list' happens before the search for the 'whitelist' key is initiated, preventing it from finding a match there. Doing some digging, it seemed that setting up autoflushing would solve my problem, but either I've misunderstood and that is not a solution to the situation, or I have implemented it incorrectly. Here is the subroutine that performs the actual search-and-replace:
sub search_and_replace { open my $target_file, "<", $_; $| = 1; my $filename = $_; while (my $target_string = <$target_file>) { for my $row (@table) { my $search = $row->[0]; my $replacement = $row->[1]; if ((lc $target_string) =~ (lc $search)) { print "Found $search in $filename in the following context +:\n"; print "$target_string\n"; if (ref($replacement) eq 'ARRAY' && length $replacement > +1) { print "Choose a replacement candidate by typing the appr +opriate number, or else type '0' to skip.\n"; my $count = 1; for my $value (@$replacement) { print "$count\) $value\n"; $count++; } my $choice = <STDIN>; if ($choice >= 1 && $choice <= (length $replacement)) { my $replace_choice = $replacement->[$choice]; edit_file { s/$search/$replace_choice/gi} $filename; } else { print "Skipping this occurrence without replacing.\n"; } } else { print "Confirm replacement with $replacement by typing y +, or skip by typing any other key.\n"; my $choice = <STDIN>; if ($choice eq 'y' || 'Y') { edit_file { s/$search/$replacement/gi } $filename; } else { print "Skipping this occurrence without replacing.\n"; } } } } } close $target_file; }
1) I was under the impression that autoflushing on the currently open filehandle can be set using the $| variable, but if that's the case, why doesn't this have any affect?
2) Is refactoring this subroutine so that it loops over the rows of the tables first, and then over each file in the inner loop a better solution? It seems like a lot more IO to be opening and closing every file over and over for each term, but I'm not a real programmer by any stretch, so I could be way off the mark here.
3) I also tried putting the line $target_file->flush; as the last line of each of the replacement 'if' statements, and that did nothing either.

Please let me know if there is any additional info I can provide that would help. Thank you very much.

Replies are listed 'Best First'.
Re: Immediately writing the results of search-and-replace
by haukex (Archbishop) on Aug 05, 2022 at 18:29 UTC
    I was under the impression that autoflushing on the currently open filehandle can be set using the $| variable, but if that's the case, why doesn't this have any affect?

    No, not quite. The documentation of $| says: "If set to nonzero, forces a flush right away and after every write or print on the currently selected output channel. ... See select on how to select the output channel." So in your code, it's likely just changing the autoflush of the default output channel, STDOUT. Unless you're on an ancient Perl, the easiest way to turn on autoflushing for a handle is $filehandle->autoflush(1); (for the nitpickers out there: the "1" isn't strictly necessary but it makes the statement look more like the other IO::Handle methods).

    However, from your problem description, I doubt your issue has to do with autoflushing at all!

    For example, if my script finds that a file contains the search string 'IP whitelist' on a given line, even if I confirm that I want to replace this with 'IP access list', it will then prompt me to replace just 'whitelist' with one of its various candidates. What I would expect to happen in that case is that the replacement of 'IP whitelist' with 'IP access list' happens before the search for the 'whitelist' key is initiated, preventing it from finding a match there. ... Is refactoring this subroutine so that it loops over the rows of the tables first, and then over each file in the inner loop a better solution? It seems like a lot more IO to be opening and closing every file over and over for each term, but I'm not a real programmer by any stretch, so I could be way off the mark here.

    Unfortunately, you haven't shown us a full example, so for example we can't see how @table is built or what its contents are - see Short, Self-Contained, Correct Example.

    Since you're prompting the user for input, my tutorial Building Regex Alternations Dynamically isn't 100% applicable to your situation. However, I still suggest you read that node as it contains important advice applicable to this situation, for example regarding word boundaries (\b) or sorting your matches to match longer strings before shorter ones, which could be the cause of the issue you describe above. You might consider building a regex from all the potential search strings, and then building a hash where the search strings are the keys and the values are the possible replacements.

    Another thing your code doesn't show is what edit_file is (again, SSCCE), but if it's opening the files and rewriting the files on every call, then yes, that's not particularly efficient. If the files you are editing will always be small enough to fit comfortably in memory, you may want to consider simply loading them as strings into memory and then editing them in memory, writing them back out when done.

    If you were to show your efforts as a complete, runnable example with representative sample input and expected output for that input, I'm sure we could help more.

      Hi haukex; I've created an account so that I can edit things going forward, but since I cannot edit the original post, please forgive posting the full code in a reply:
      use strict; use warnings; use diagnostics; use Text::CSV_XS; use File::Find; use File::Slurp qw(edit_file); use List::Util qw(uniq); my $spreadsheet = Text::CSV_XS->new ({ binary => 1, auto_diag => 1 }); # Consume the first argument (the spreadsheet input) and push its # contents to an array of arrays. open my $fh, "<", $ARGV[0]; my @table; while (my $row = $spreadsheet->getline($fh)) { push @table, $row; } close $fh; # Sort the array-of-arrays by the length of the # first element of each subarray (our 'keys') such # that the longest key comes first, and the # shortest comes last. Then sort alphabetically. @table = sort { length $b->[0] <=> length $a->[0] || $a->[0] cmp $b->[ +0] } @table; # CSVs are received from another department with # three columns, where the second denotes part of # speech. This isn't needed for this operation, so # discard it. for my $row (@table) { splice(@$row,1,1); $row->[0] = quotemeta($row->[0]); } my @rows_to_cull = (); # Pairwise comparison of rows to detect duplicate keys. # Sort having been performed earlier, only pairwise comparisons # of adjacent rows will result in matches. for my $row_a (0..$#table) { my $key = $table[$row_a]->[0]; my @values = $table[$row_a]->[1]; for my $row_b ($row_a+1..$#table) { if ( $table[$row_a]->[0] eq $table[$row_b]->[0] ) { push @values, $table[$row_b]->[1]; $table[$row_a]->[1] = \@values; push @rows_to_cull, $row_b; } else { last; } } } # Convert the array to a hash to ensure uniqueness, # then back to an array for traversal, sorting the # array in descending numeric order so removal of # earlier rows doesn't interfere with indexing for # later rows. my %cull_hash = map { $_ => 1 } @rows_to_cull; my @cull_array = keys %cull_hash; @cull_array = sort { $b <=> $a } @cull_array; for (@cull_array) { splice(@table,$_,1); } # Loop which ensures the uniqueness of elements in the # replacement candidate arrays. for my $row (0..$#table) { my $replacement = $table[$row]->[1]; if (ref($replacement) eq 'ARRAY' && length $replacement > 1) { my @unique = uniq(@$replacement); $table[$row]->[1] = \@unique; } } # The following takes the second argument from the # command line as the directory to operate on. It # also identifies a filter subroutine to exclude # non-documentation files from the checks, and a # subroutine that will actually do the search-and-replace. find({ preprocess => \&filter, wanted => \&search_and_replace }, $ARGV[1] ); sub filter { return grep { -d or (-f and ( /\.txt$/ or /\.rst$/ or /\.yaml$/))} + @_; } # Main loop sub search_and_replace { open my $target_file, "<", $_; $| = 1; my $filename = $_; while (my $target_string = <$target_file>) { for my $row (@table) { my $search = $row->[0]; my $replacement = $row->[1]; if ((lc $target_string) =~ (lc $search)) { print "Found $search in $filename in the following context +:\n"; print "$target_string\n"; if (ref($replacement) eq 'ARRAY' && length $replacement > +1) { print "Choose a replacement candidate by typing the appr +opriate number, or else type '0' to skip.\n"; my $count = 1; for my $value (@$replacement) { print "$count\) $value\n"; $count++; } my $choice = <STDIN>; if ($choice >= 1 && $choice <= (length $replacement)) { my $replace_choice = $replacement->[$choice]; edit_file { s/$search/$replace_choice/gi} $filename; } else { print "Skipping this occurrence without replacing.\n"; } } else { print "Confirm replacement with $replacement by typing y +, or skip by typing any other key.\n"; my $choice = <STDIN>; if ($choice eq 'y' || 'Y') { edit_file { s/$search/$replacement/gi } $filename; } else { print "Skipping this occurrence without replacing.\n"; } } } } } close $target_file; }
      As for the input, here are some example rows from the CSV file itself, pre-sorted:
      whitelist entries,Noun,access list entries IP whitelist,Noun,IP access list IP whitelist entries,Noun,IP access list entries whitelist,Adjective,allow whitelist,Noun,access list whitelist,Noun,access-list whitelist,Verb,allow your whitelist,Noun,your access list
      As far as output goes, the sort should (and currently does) result in an entry like 'IP whitelist' being checked for before 'whitelist' itself, and 'IP whitelist entries' before 'IP whitelist; so I would expect that after I confirm wanting to replace 'IP whitelist entries' with 'IP access list entries', I would NOT be subsequently prompted to replace 'IP whitelist' within that instance of 'IP whitelist entries', because 'whitelist' would have already been replaced there. Similarly, I wouldn't want to be prompted to replace 'whitelist' alone in that spot for the same reason. Please let me know what more might be needed. I have started looking at the tutorial you linked and it seems promising.

      EDIT: Corrected formatting errors due to copy-pasting code block with long lines.

        Thanks for providing additional context.

        if my script finds that a file contains the search string 'IP whitelist' on a given line, even if I confirm that I want to replace this with 'IP access list', it will then prompt me to replace just 'whitelist' with one of its various candidates. What I would expect to happen in that case is that the replacement of 'IP whitelist' with 'IP access list' happens before the search for the 'whitelist' key is initiated, preventing it from finding a match there.

        What is actually going on (and what I didn't realize on my first reading) is that you're reading a line from the input file into the $target_string variable, but once that line is in that variable, it never changes, regardless of whether you edit the file it came from, since the line was copied from the file into memory. That's why the regex engine continues to find matches in $target_string even after you've made the edit in the file.

        The common pattern to use in Perl instead is to write an output file with the modified lines while reading the input file. Since in your code, you seem to want to do in-place edits of the file, see my node on that here. The slurp-then-spew approach is fairly easy if your files will always fit comfortably into memory; for line-by-line editing I might recommend my own module File::Replace (for examples of line-by-line editing, see the module's documentation instead of the aforementioned node).

        In other words, instead of your edit_file calls, you'd do something like $target_string =~ s/$search/$replace_choice/gi, and then build your output file by writing out each $target_string line. However, there is still more trickyness here: If you have a line such as "Edit the whitelist by pressing the edit whitelist button." or something like that, then because of the /g modifier on that regex, you'd only be presented with the choice for what term to replace "whilelist" with once, and that replacement would be carried out for the entire line. I kind of doubt that's a desired behavior for you.

        BTW, are you sure your search terms will always be on one line, or could strings like "whitelist entries" be split on two lines? That would also require an advanced approach.

        These are examples for reasons why I asked for representative sample input along with the expected output for that input, and in this case representative also means that it should include all of such "interesting" cases. While I'd like to provide sample code or at least more specific advice on how to fix your code, without knowing the answers to questions like these, I run the risk of writing something that ends up not working on your actual data. Extensive test cases are very useful!

        What I am currently thinking is that it should be possible to do something like s{ ($search) }{ get_replacement($1) }xeg where sub get_replacement does the prompting of the user and returns the replacement term (or the original string if the user doesn't want to replace). $search can even be a regex of all of the search terms, built dynamically.

        A few more things I noticed in your code:

        • Check your opens for errors - "open" Best Practices
        • Even though you're using File::Find, for better modularity I would use regular argument passing to sub search_and_replace, and write e.g. wanted => sub { search_and_replace(\@table, $_) }
        • It seems to me like your @table could be replaced by a hash instead of all the duplicate key detection code you currently have. Keys in Perl hashes are unique, and such tables are easy to build with Perl's autovivification - if I treat a nonexistent hash entry as if it were an array reference, it will automagically become an array reference. Note that since you want to do case-insenstive matching I'm using fc, which is better than lc in case of Unicode. Building a hash of arrays using your sample data:
          use warnings; use strict; use feature qw/fc/; use Text::CSV_XS; use List::Util qw/uniq/; use Data::Dump; my $csvfile = $ARGV[0]; my %table; open my $fh, '<', $csvfile or die "$csvfile: $!"; my $csv = Text::CSV_XS->new({binary=>1, auto_diag=>2}); while ( my $row = $csv->getline($fh) ) { push @{$table{ fc($row->[0]) }}, $row->[2]; } $csv->eof or $csv->error_diag; close $fh; $_ = [uniq @$_] for values %table; dd \%table; __END__ { "ip whitelist" => ["IP access list"], "ip whitelist entries" => ["IP access list entries"], "whitelist" => ["allow", "access list", "access-list"], "whitelist entries" => ["access list entries"], "your whitelist" => ["your access list"], }
        • As before, instead of looping over @tables (or %tables), building a single regex should be more efficient.
        • You may wish to consider modules for prompting instead of rolling your own.
        • (It's Friday evening here so I didn't scan the rest of your code in detail, so other Monks may have further tips.)

        G'day elemkeh,

        Welcome to the Monastery.

        The way you've handled transitioning from anonymous to real user is fine: no need to apologise. There is, however, a small problem.

        Long lines within <code>...</code> tags are wrapped. The continuation is highlighted with a red (default) '+'. You've copied code which has four wrapped lines and these now appear as actual code:

        • +:\n";
        • +1) {
        • +opriate number, or else type '0' to skip.\n";
        • +, or skip by typing any other key.\n";

        For future reference, follow the [download] link to a plain text version; then copy and paste from there.

        As you rightly state, being logged in, you can edit your post and make four minor changes to fix this. Do leave a note indicating that you've made a change — "How do I change/delete my post?" has more about that.

        — Ken

        Some general observations on the code:

        • if (ref($replacement) eq 'ARRAY' && length $replacement > 1) { ... }

          length always operates on a string or stringized expression. A stringized reference looks something like "ARRAY(0x123abc)", so the expression length $replacement > 1 will always be true if $replacement is a reference.

          Win8 Strawberry 5.8.9.5 (32) Fri 08/05/2022 17:16:08 C:\@Work\Perl\monks >perl use strict; use warnings; my $arrayref = [ ]; # ref. to zero length array print "$arrayref \n"; print "length non-zero \n" if length $arrayref > 1; ^Z ARRAY(0x613824) length non-zero
          The number of elements of an array is found by evaluating the array in scalar context (update: see Context and subsequent topics in perldata) or, in newer Perls (5.12+), by the keys built-in evaluated in scalar context (check your documentation).
          Win8 Strawberry 5.30.3.1 (64) Fri 08/05/2022 17:39:54 C:\@Work\Perl\monks >perl use strict; use warnings; my $arrayref = [ 9, 8, ]; # ref. to NON-zero length array printf "number of elements of referenced array: %d \n", scalar @$ +arrayref; printf "number of elements of referenced array: %d \n", 0 + @$ +arrayref; printf "number of elements of referenced array: %d \n", scalar keys @$ +arrayref; print "number of elements > 1 \n" if @$arrayref > 1; print "number of elements > 1 \n" if keys @$arrayref > 1; ^Z number of elements of referenced array: 2 number of elements of referenced array: 2 number of elements of referenced array: 2 number of elements > 1 number of elements > 1
          (Update: Just to be clear, a proper re-statement of the condition expression would be
              if (ref($replacement) eq 'ARRAY' && @$replacement > 1) { ... }
          The > numeric comparison imposes scalar context.)
        • if ($choice eq 'y' || 'Y') { ... }

          This condition evaluates as "if $choice is 'y' or if the string 'Y' is true", i.e., it's always true because 'Y' is always true. See What is true and false in Perl?

          An effective case-insensitive conditional test would be
              if (lc($choice) eq 'y') { ... }


        Give a man a fish:  <%-{-{-{-<

        Ok, I'm confused. Do you want to rewrite the entire file after each change (of course with the changed text) and then reload the file to do the next replace?
        Or do you just want to do a replace at a time (holding the text in memory) and only rewrite the entire file when you are done?

Re: Immediately writing the results of search-and-replace
by kcott (Archbishop) on Aug 06, 2022 at 04:30 UTC

    This looked like an interesting problem to get the mental juices flowing on a Saturday morning. I see you've already received much help so, rather than trying to modify your code, I decided to write it from scratch. A few notes on some the things I did follow; ask if you want to know about anything else.

    • The autodie pragma handles I/O exceptions so I don't have to worry about doing that. I note that you're not checking any I/O operations.
    • I used constants to specify the indices of the search and replace elements in the CSV file. If the CSV format ever changes, you'll probably only need to change one or two numbers; as opposed to recoding splice() statements in a couple of places.
    • I used IO::Prompter for the choice of replacement text where more than one option existed. That used a lot less code than what's curently in search_and_replace() and has many more features. Options are presented in the form of a menu and you only need to hit one key (i.e. no newline) to make a selection. I've also used colour: if you want to use that you may need to fiddle with it a bit (it looks good on my black background). That uses Term::ANSIColor; I've added extra colour in a few places, such as highlighting the text to be replaced.
    • Text::CSV will use Text::CSV_XS if it's installed; but the code will still work if it's not.
    • The code is modular and pretty much all the work is done in three, short subroutines: _process_csv(), _process_text() and _replace(). Also note that those routines only use variables that are passed to them: there's no action-at-a-distance, or similar problems, from global or file-scoped variables.

    Here's whitelist_sr.pl:

    #!/usr/bin/env perl use strict; use warnings; use autodie; use constant { SEARCH => 0, REPLACE => 2, }; use File::Copy; use List::Util 'uniq'; use IO::Prompter [ -single, -style => 'bold blue', -echostyle => 'bold magenta', ]; use Term::ANSIColor; use Text::CSV; # For testing: BEGIN { copy('whitelist_sr_original.txt', 'whitelist_sr.txt'); } my $csv_file = 'whitelist_sr.csv'; my $source_file = 'whitelist_sr_src.txt'; my $text_file = 'whitelist_sr.txt'; copy($text_file, $source_file); my ($sr, $re) = _process_csv($csv_file); _process_text($source_file, $text_file, $sr, $re); unlink $source_file; sub _process_text { my ($source_file, $text_file, $sr, $re) = @_; open my $src_fh, '<', $source_file; open my $txt_fh, '>', $text_file; while (my $line = <$src_fh>) { while ($line =~ /$re/g) { my $frag = $1; $line =~ s/$frag/_replace($line, $frag, $sr, $re)/e; } print $txt_fh $line; } return; } sub _replace { my ($line, $frag, $sr, $re) = @_; my $replace; if (@{$sr->{$frag}} == 1) { $replace = $sr->{$frag}[0]; } else { $line =~ /$re/; $line = colored(substr($line, 0, $-[0]), 'dark yellow') . colored($1, 'bold yellow') . colored(substr($line, $+[0]), 'dark yellow'); print colored("Current line:", 'bold white'), ' ', $line; my $index = prompt "Replace '$frag' with:", -number, -menu => { map +($sr->{$frag}[$_] => $_), 0 .. $#{$sr->{$frag} +} }, '> '; $replace = $sr->{$frag}[$index]; } return $replace; } sub _process_csv { my ($csv_file) = @_; my $sr = {}; my $csv = Text::CSV::->new(); open my $fh, '<', $csv_file; while (my $row = $csv->getline($fh)) { push @{$sr->{$row->[SEARCH]}}, $row->[REPLACE]; } $sr->{$_} = [ uniq @{$sr->{$_}} ] for keys %$sr; my $alt = join '|', map "\Q$_", sort { length $b <=> length $a } keys %$sr; my $re = qr{\b($alt)\b}; return ($sr, $re); }

    I extended your CSV file to enable more testing. It's in the spoiler:

    I also created my own text file for testing. Again, in the spoiler:

    Here's a rough representation of part of a run, showing dynamic changes to a line as a number of matches are found and the selected substitutions are made:

    Current line: whitelist whitelist whitelist whitelist
    Replace 'whitelist' with:
       1. access list
       2. access-list
       3. allow
    
    > 2
    Current line: access-list whitelist whitelist whitelist
    Replace 'whitelist' with:
       1. access list
       2. access-list
       3. allow
    
    > 3
    Current line: access-list allow whitelist whitelist
    Replace 'whitelist' with:
       1. access list
       2. access-list
       3. allow
    
    > 1
    Current line: access-list allow access list whitelist
    Replace 'whitelist' with:
       1. access list
       2. access-list
       3. allow
    
    > 2
    

    Here's the final result after that full run. Spoiler again:

    — Ken

      Your code is far more elegant than what I had written originally; I will be reading it carefully to add it to my understanding and I'm sure it will improve my Perl efforts going forward. Thank you.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11145971]
Approved by haukex
Front-paged by kcott
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others scrutinizing the Monastery: (4)
As of 2024-03-19 05:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found