Re^2: Immediately writing the results of search-and-replace

Hi haukex; I've created an account so that I can edit things going forward, but since I cannot edit the original post, please forgive posting the full code in a reply:

use strict;
use warnings;
use diagnostics;

use Text::CSV_XS;
use File::Find;
use File::Slurp qw(edit_file);
use List::Util qw(uniq);

my $spreadsheet = Text::CSV_XS->new ({ binary => 1, auto_diag => 1 });

# Consume the first argument (the spreadsheet input) and push its
# contents to an array of arrays.
open my $fh, "<", $ARGV[0];
my @table;
while (my $row = $spreadsheet->getline($fh)) {
    push @table, $row;
}
close $fh;

# Sort the array-of-arrays by the length of the 
# first element of each subarray (our 'keys') such 
# that the longest key comes first, and the
# shortest comes last. Then sort alphabetically.
@table = sort { length $b->[0] <=> length $a->[0] || $a->[0] cmp $b->[
+0] } @table; 

# CSVs are received from another department with 
# three columns, where the second denotes part of 
# speech. This isn't needed for this operation, so 
# discard it.
for my $row (@table) {
    splice(@$row,1,1);
    $row->[0] = quotemeta($row->[0]);
}

my @rows_to_cull = ();

# Pairwise comparison of rows to detect duplicate keys. 
# Sort having been performed earlier, only pairwise comparisons
# of adjacent rows will result in matches.
for my $row_a (0..$#table) {
    my $key = $table[$row_a]->[0];
    my @values = $table[$row_a]->[1];

    for my $row_b ($row_a+1..$#table) {
        if ( $table[$row_a]->[0] eq $table[$row_b]->[0] ) {
            push @values, $table[$row_b]->[1];
            $table[$row_a]->[1] = \@values;
            push @rows_to_cull, $row_b;
        }
        else {
            last;
        }
    }
}

# Convert the array to a hash to ensure uniqueness, 
# then back to an array for traversal, sorting the 
# array in descending numeric order so removal of 
# earlier rows doesn't interfere with indexing for 
# later rows.
my %cull_hash = map { $_ => 1 } @rows_to_cull;
my @cull_array = keys %cull_hash;
@cull_array = sort { $b <=> $a } @cull_array;

for (@cull_array) {
splice(@table,$_,1);
}

# Loop which ensures the uniqueness of elements in the 
# replacement candidate arrays.
for my $row (0..$#table) {
    my $replacement = $table[$row]->[1];
    if (ref($replacement) eq 'ARRAY' && length $replacement > 1) {
        my @unique = uniq(@$replacement);
        $table[$row]->[1] = \@unique;
    }
}

# The following takes the second argument from the 
# command line as the directory to operate on. It 
# also identifies a filter subroutine to exclude 
# non-documentation files from the checks, and a 
# subroutine that will actually do the search-and-replace.
find({ 
    preprocess => \&filter, 
    wanted => \&search_and_replace 
    }, 
    $ARGV[1]
);

sub filter {
    return grep { -d or (-f and ( /\.txt$/ or /\.rst$/ or /\.yaml$/))}
+ @_;
}

# Main loop

sub search_and_replace {
      open my $target_file, "<", $_;
      $| = 1;
      my $filename = $_;
      while (my $target_string = <$target_file>) {
        for my $row (@table) {
          my $search = $row->[0];
          my $replacement = $row->[1];
               
          if ((lc $target_string) =~ (lc $search)) {
            print "Found $search in $filename in the following context
+:\n";
            print "$target_string\n";
            if (ref($replacement) eq 'ARRAY' && length $replacement > 
+1) {
              print "Choose a replacement candidate by typing the appr
+opriate number, or else type '0' to skip.\n";
              my $count = 1;
              for my $value (@$replacement) {
                print "$count\) $value\n";
                  $count++;
              }
              my $choice = <STDIN>;
              if ($choice >= 1 && $choice <= (length $replacement)) {
                my $replace_choice = $replacement->[$choice];
                edit_file { s/$search/$replace_choice/gi} $filename;
              }
              else {
                print "Skipping this occurrence without replacing.\n";
              }
            }
            else {
              print "Confirm replacement with $replacement by typing y
+, or skip by typing any other key.\n";
              my $choice = <STDIN>;
              if ($choice eq 'y' || 'Y') {
                edit_file { s/$search/$replacement/gi } $filename;
              }
              else {
                print "Skipping this occurrence without replacing.\n";
              }
            }
          }
        }
      }
      close $target_file;
}
[download]

As for the input, here are some example rows from the CSV file itself, pre-sorted:

whitelist entries,Noun,access list entries
IP whitelist,Noun,IP access list
IP whitelist entries,Noun,IP access list entries
whitelist,Adjective,allow
whitelist,Noun,access list
whitelist,Noun,access-list
whitelist,Verb,allow
your whitelist,Noun,your access list
[download]

As far as output goes, the sort should (and currently does) result in an entry like 'IP whitelist' being checked for before 'whitelist' itself, and 'IP whitelist entries' before 'IP whitelist; so I would expect that after I confirm wanting to replace 'IP whitelist entries' with 'IP access list entries', I would NOT be subsequently prompted to replace 'IP whitelist' within that instance of 'IP whitelist entries', because 'whitelist' would have already been replaced there. Similarly, I wouldn't want to be prompted to replace 'whitelist' alone in that spot for the same reason. Please let me know what more might be needed. I have started looking at the tutorial you linked and it seems promising.

EDIT: Corrected formatting errors due to copy-pasting code block with long lines.

Comment on Re^2: Immediately writing the results of search-and-replace Select or Download Code

Replies are listed 'Best First'.
Re^3: Immediately writing the results of search-and-replace by haukex (Archbishop) on Aug 05, 2022 at 20:20 UTC
Thanks for providing additional context. if my script finds that a file contains the search string 'IP whitelist' on a given line, even if I confirm that I want to replace this with 'IP access list', it will then prompt me to replace just 'whitelist' with one of its various candidates. What I would expect to happen in that case is that the replacement of 'IP whitelist' with 'IP access list' happens before the search for the 'whitelist' key is initiated, preventing it from finding a match there. What is actually going on (and what I didn't realize on my first reading) is that you're reading a line from the input file into the `$target_string` variable, but once that line is in that variable, it never changes, regardless of whether you edit the file it came from, since the line was copied from the file into memory. That's why the regex engine continues to find matches in `$target_string` even after you've made the edit in the file. The common pattern to use in Perl instead is to write an output file with the modified lines while reading the input file. Since in your code, you seem to want to do in-place edits of the file, see my node on that here. The slurp-then-spew approach is fairly easy if your files will always fit comfortably into memory; for line-by-line editing I might recommend my own module File::Replace (for examples of line-by-line editing, see the module's documentation instead of the aforementioned node). In other words, instead of your `edit_file` calls, you'd do something like `$target_string =~ s/$search/$replace_choice/gi`, and then build your output file by writing out each `$target_string` line. However, there is still more trickyness here: If you have a line such as `"Edit the whitelist by pressing the edit whitelist button."` or something like that, then because of the `/g` modifier on that regex, you'd only be presented with the choice for what term to replace `"whilelist"` with once, and that replacement would be carried out for the entire line. I kind of doubt that's a desired behavior for you. BTW, are you sure your search terms will always be on one line, or could strings like `"whitelist entries"` be split on two lines? That would also require an advanced approach. These are examples for reasons why I asked for representative sample input along with the expected output for that input, and in this case representative also means that it should include all of such "interesting" cases. While I'd like to provide sample code or at least more specific advice on how to fix your code, without knowing the answers to questions like these, I run the risk of writing something that ends up not working on your actual data. Extensive test cases are very useful! What I am currently thinking is that it should be possible to do something like `s{ ($search) }{ get_replacement($1) }xeg` where `sub get_replacement` does the prompting of the user and returns the replacement term (or the original string if the user doesn't want to replace). `$search` can even be a regex of all of the search terms, built dynamically. A few more things I noticed in your code: Check your opens for errors - "open" Best Practices Even though you're using File::Find, for better modularity I would use regular argument passing to `sub search_and_replace`, and write e.g. `wanted => sub { search_and_replace(\@table, $_) }` It seems to me like your `@table` could be replaced by a hash instead of all the duplicate key detection code you currently have. Keys in Perl hashes are unique, and such tables are easy to build with Perl's autovivification - if I treat a nonexistent hash entry as if it were an array reference, it will automagically become an array reference. Note that since you want to do case-insenstive matching I'm using fc, which is better than lc in case of Unicode. Building a hash of arrays using your sample data: use warnings; use strict; use feature qw/fc/; use Text::CSV_XS; use List::Util qw/uniq/; use Data::Dump; my $csvfile = $ARGV[0]; my %table; open my $fh, '<', $csvfile or die "$csvfile: $!"; my $csv = Text::CSV_XS->new({binary=>1, auto_diag=>2}); while ( my $row = $csv->getline($fh) ) { push @{$table{ fc($row->[0]) }}, $row->[2]; } $csv->eof or $csv->error_diag; close $fh; $_ = [uniq @$_] for values %table; dd \%table; __END__ { "ip whitelist" => ["IP access list"], "ip whitelist entries" => ["IP access list entries"], "whitelist" => ["allow", "access list", "access-list"], "whitelist entries" => ["access list entries"], "your whitelist" => ["your access list"], } [download] As before, instead of looping over `@tables` (or `%tables`), building a single regex should be more efficient. You may wish to consider modules for prompting instead of rolling your own. (It's Friday evening here so I didn't scan the rest of your code in detail, so other Monks may have further tips.)	[reply] [d/l] [select]
Re^4: Immediately writing the results of search-and-replace [OT - FYI] by kcott (Archbishop) on Aug 05, 2022 at 22:12 UTC
G'day haukex, I hadn't used File::Replace previously, so I thought I'd give it a go; unfortunately it failed during testing. I raised an issue: "t/11_chmod.t fails on Cygwin v5.36.0". — Ken	[reply]
Re^5: Immediately writing the results of search-and-replace [OT - FYI] by haukex (Archbishop) on Aug 07, 2022 at 19:41 UTC
Thanks very much, I'll look into this as soon as I can! The module used to have perfect tests on CPAN Testers, but unfortunately ever since I introduced File::Replace::Inplace, there have been spurious test failures. I hope this is one of those cases, and I think you'd be ok to force install the module anyway, the core functionality is quite well tested.	[reply]
Re^6: Immediately writing the results of search-and-replace [OT - FYI] by kcott (Archbishop) on Aug 08, 2022 at 02:12 UTC
Re^5: Immediately writing the results of search-and-replace [OT - FYI] by haukex (Archbishop) on Jan 22, 2023 at 13:03 UTC
I hadn't used File::Replace previously, so I thought I'd give it a go; unfortunately it failed during testing. I raised an issue: "t/11_chmod.t fails on Cygwin v5.36.0". I've finally fixed this issue! v0.18 of the module should be going up on CPAN in the next hour or so.	[reply]
Re^6: Immediately writing the results of search-and-replace [OT - FYI] by kcott (Archbishop) on Jan 22, 2023 at 20:06 UTC
Re^4: Immediately writing the results of search-and-replace by elemkeh (Acolyte) on Aug 09, 2022 at 14:34 UTC
Just wanted to make sure I took the time to thank you for your help. You've given me plenty to chew on and improve my Perl with. Cheers!	[reply]
Re^3: Immediately writing the results of search-and-replace by kcott (Archbishop) on Aug 05, 2022 at 19:52 UTC
G'day elemkeh, Welcome to the Monastery. The way you've handled transitioning from anonymous to real user is fine: no need to apologise. There is, however, a small problem. Long lines within `<code>...</code>` tags are wrapped. The continuation is highlighted with a red (default) '+'. You've copied code which has four wrapped lines and these now appear as actual code: `+:\n";` `+1) {` `+opriate number, or else type '0' to skip.\n";` `+, or skip by typing any other key.\n";` For future reference, follow the [download] link to a plain text version; then copy and paste from there. As you rightly state, being logged in, you can edit your post and make four minor changes to fix this. Do leave a note indicating that you've made a change — "How do I change/delete my post?" has more about that. — Ken	[reply] [d/l] [select]
Re^3: Immediately writing the results of search-and-replace by AnomalousMonk (Archbishop) on Aug 05, 2022 at 22:18 UTC
Some general observations on the code: `if (ref($replacement) eq 'ARRAY' && length $replacement > 1) { ... }` length always operates on a string or stringized expression. A stringized reference looks something like `"ARRAY(0x123abc)"`, so the expression `length $replacement > 1` will always be true if `$replacement` is a reference. `Win8 Strawberry 5.8.9.5 (32) Fri 08/05/2022 17:16:08 C:\@Work\Perl\monks >perl use strict; use warnings; my $arrayref = [ ]; # ref. to zero length array print "$arrayref \n"; print "length non-zero \n" if length $arrayref > 1; ^Z ARRAY(0x613824) length non-zero` [download] The number of elements of an array is found by evaluating the array in scalar context (update: see Context and subsequent topics in perldata) or, in newer Perls (5.12+), by the keys built-in evaluated in scalar context (check your documentation). Win8 Strawberry 5.30.3.1 (64) Fri 08/05/2022 17:39:54 C:\@Work\Perl\monks >perl use strict; use warnings; my $arrayref = [ 9, 8, ]; # ref. to NON-zero length array printf "number of elements of referenced array: %d \n", scalar @$ +arrayref; printf "number of elements of referenced array: %d \n", 0 + @$ +arrayref; printf "number of elements of referenced array: %d \n", scalar keys @$ +arrayref; print "number of elements > 1 \n" if @$arrayref > 1; print "number of elements > 1 \n" if keys @$arrayref > 1; ^Z number of elements of referenced array: 2 number of elements of referenced array: 2 number of elements of referenced array: 2 number of elements > 1 number of elements > 1 [download] (Update: Just to be clear, a proper re-statement of the condition expression would be `if (ref($replacement) eq 'ARRAY' && @$replacement > 1) { ... }` The `>` numeric comparison imposes scalar context.) `if ($choice eq 'y' \|\| 'Y') { ... }` This condition evaluates as "if `$choice` is `'y'` or if the string `'Y'` is true", i.e., it's always true because `'Y'` is always true. See What is true and false in Perl? An effective case-insensitive conditional test would be `if (lc($choice) eq 'y') { ... }` Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^4: Immediately writing the results of search-and-replace by elemkeh (Acolyte) on Aug 09, 2022 at 14:35 UTC
Thank you for the feedback. I will do my best to internalize it and write better Perl going forward.	[reply]
Re^3: Immediately writing the results of search-and-replace by tybalt89 (Monsignor) on Aug 08, 2022 at 09:40 UTC
Ok, I'm confused. Do you want to rewrite the entire file after each change (of course with the changed text) and then reload the file to do the next replace? Or do you just want to do a replace at a time (holding the text in memory) and only rewrite the entire file when you are done?	[reply]
Re^4: Immediately writing the results of search-and-replace by elemkeh (Acolyte) on Aug 08, 2022 at 16:16 UTC
What I am chiefly concerned with is that after matching on and replacing an instance of, say, 'IP whitelist', the user is not prompted to replace that same instance of the word 'whitelist' because it hasn't actually been edited yet. The impression I get from reading some of what's written here is that this might require write-and-reload, but I must admit that I don't know enough about Perl yet to feel confident in my interpretations. Sorry I can't clarify more than that.	[reply]
Re^5: Immediately writing the results of search-and-replace by tybalt89 (Monsignor) on Aug 08, 2022 at 16:48 UTC
Fun with pos() This rewrites the file after each change, then reloads from that file for the next change. The current position in the file is kept in the variable `$pos`. This stops the overlapping problem you were having. I'm using Path::Tiny because I get yelled at when I use File::Slurp. #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11145971 use warnings; use List::AllUtils qw( nsort_by uniq ); use Path::Tiny; use Text::CSV qw( csv ); my $csv = <<END; whitelist entries,Noun,access list entries IP whitelist,Noun,IP access list IP whitelist entries,Noun,IP access list entries whitelist,Adjective,allow whitelist,Noun,access list whitelist,Noun,access-list whitelist,Verb,allow your whitelist,Noun,your access list your whitelist,Noun,my permission list your whitelist,Noun,thy permission list your whitelist,Noun,his permission list your whitelist,Noun,her permission list your whitelist,Noun,its permission list your whitelist,Noun,our permission list your whitelist,Noun,your permission list your whitelist,Noun,their permission list END my $target_file = path( '/tmp/d.11145971' ); $target_file->spew( <<END ); # FIXME for testing a. this is some whitelist test b. for IP whitelist testing and your whitelist c. line with no replacement d. with whitelist and IP whitelist entries and IP whitelist entries ag +ain. e. this is another whitelist test END my ($old, $new, $reset) = map "\e[${_}m", 91, 92, 0; my %patterns; push @{ $patterns{ lc $_->[0] } }, $_->[2] for @{ csv( in => \$csv ) } +; my $match = qr/(@{[ join '\|', nsort_by { -length } keys %patterns ]})/ +i; #use Data::Dump 'dd'; dd \%patterns, $match; #$target_file->edit_lines( sub # NOTE line at a time # { # s/$match/$old$&$reset/g ? print "\n" : return; # 1 while s/\Q$old\E$match\Q$reset\E/ print; ask( $1 ) /e; # print "$new$_$reset"; # } ); my $pos = 0; # NOTE replaces whole file for each change my $more = 1; while( $more ) { $target_file->edit( sub { pos($_) = $pos; if( /$match/g ) { my ( $was, $where, $pre, $post ) = ( $1, $-[1], $`, $'); print "\n", $pre =~ s/^.\n(?=.\n)//sr, "$old$was$reset", $post =~ s/\n.?\n\K.//sr, "\n"; my $replace = ask( $was ); $was eq $replace or substr $_, $where, length $was, $replace; $pos = $where + length $replace; } else { $more = 0 } } ); # print "\e[33m", $target_file->slurp, "\e[0m"; # FIXME here for testi +ng } print "\nresult:\n\n", $target_file->slurp; # FIXME here for testing sub ask { my ($was) = @_; my @choices = uniq @{ $patterns{ lc $was } }; local $\| = 1; if( @choices > 1 ) { my $n = 1; printf "%8d. $new%s$reset\n", $n++, $_ for @choices; print " replace '$old$was$reset' with ${new}above pick$reset ( +or 0 to not change) : "; my $pick = <STDIN> =~ tr/0-9//cdr \|\| 0; 0 <= $pick <= @choices or $pick = 0; return ($was, @choices)[$pick]; } else { print " replace '$old$was$reset' with '$new@choices$reset' ? y +es/no : "; return <STDIN> =~ /y/i ? $choices[0] : $was; } } [download] Sample run (in glorious color which of course is not seen here) a. this is some whitelist test b. for IP whitelist testing and your whitelist 1. allow 2. access list 3. access-list replace 'whitelist' with above pick (or 0 to not change) : a. this is some whitelist test b. for IP whitelist testing and your whitelist c. line with no replacement replace 'IP whitelist' with 'IP access list' ? yes/no : a. this is some whitelist test b. for IP whitelist testing and your whitelist c. line with no replacement 1. your access list 2. my permission list 3. thy permission list 4. his permission list 5. her permission list 6. its permission list 7. our permission list 8. your permission list 9. their permission list replace 'your whitelist' with above pick (or 0 to not change) : c. line with no replacement d. with whitelist and IP whitelist entries and IP whitelist entries ag +ain. e. this is another whitelist test 1. allow 2. access list 3. access-list replace 'whitelist' with above pick (or 0 to not change) : c. line with no replacement d. with whitelist and IP whitelist entries and IP whitelist entries ag +ain. e. this is another whitelist test replace 'IP whitelist entries' with 'IP access list entries' ? ye +s/no : c. line with no replacement d. with whitelist and IP whitelist entries and IP whitelist entries ag +ain. e. this is another whitelist test replace 'IP whitelist entries' with 'IP access list entries' ? ye +s/no : d. with whitelist and IP whitelist entries and IP whitelist entries ag +ain. e. this is another whitelist test 1. allow 2. access list 3. access-list replace 'whitelist' with above pick (or 0 to not change) : result: a. this is some whitelist test b. for IP whitelist testing and your whitelist c. line with no replacement d. with whitelist and IP whitelist entries and IP whitelist entries ag +ain. e. this is another whitelist test [download] The above output is in color, using red for the existing string and green for new ones. This should be run in a terminal that takes ANSI color codes. Is this the kind of thing you are looking for, or have I gone overboard ?	[reply] [d/l] [select]
Re^6: Immediately writing the results of search-and-replace by elemkeh (Acolyte) on Aug 08, 2022 at 18:02 UTC
Re^7: Immediately writing the results of search-and-replace by tybalt89 (Monsignor) on Aug 08, 2022 at 18:13 UTC
Some notes below your chosen depth have not been shown here


laziness, impatience, and hubris
	PerlMonks