http://qs321.pair.com?node_id=512731

stevee has asked for the wisdom of the Perl Monks concerning the following question:

Hello Venerable Monks
I have a number of flat text files (500). The text contains narrative which may contain product names (zero, one or more in each narrative), set IN CAPITALS quite helpfully. I would like to anonymise the data a little and replace the current product names with random other names. I have a randomised array of product names which contains far more names than I can use. My question is, is there a shortcut to this through regular expressions or some other means?

As an example:
"The respondent uses the following products XXX, YYYYYYYYY, ZZZZZZZ around the house and they are considering using QQQQQQQ too. They are particularly impressed with ZZZZZZZ."

Which I would like to change to

"The respondent uses the following products AAA, BBBBBB, CCCCCC around the house and they are considering using DDDDDDDD too. They are particularly impressed with CCCCCC."

Any help would be really appreciated as I am slowly getting up to speed with PERL but not fast enough!
Thanks in advance,
Stevee

Replies are listed 'Best First'.
Re: String replace
by ikegami (Patriarch) on Nov 29, 2005 at 18:49 UTC

    davidrw's solution could end up picking the same product multiple times. The code below solves that problem.

    my $text_orig = "The respondent uses the following products XXX, YYYYY +YYYY, ZZZZZZZ around the house and they are considering using QQQQQQQ + too. They are particularly impressed with ZZZZZZZ.\n"; my @products = ( 'AAA', 'BBBBBB', 'CCCCCC', 'DDDDDDDD', 'QQQQQQQ', 'XXX', 'YYYYYYYYY', 'ZZZZZZZ', ); for (1..4) { my @pot; (my $text_new = $text_orig) =~ s/[A-Z]{3,}/ @pot = @products if not @pot; my $pot_idx = int(rand(@pot)); splice(@pot, $pot_idx, 1) /eg; print($text_new); }

    Update: Another option is to use List::Util's shuffle:

    use List::Util qw( shuffle ); my @pot; for (1..4) { (my $text_new = $text_orig) =~ s/[A-Z]{3,}/ @pot = shuffle @products if not @pot; pop(@pot) /eg; print($text_new); }

    Update: Adapted from Nkuvu and Roy Johnson's code on the CB, what follows is good for picking a few products from a large list of products.

    for (1..4) { my %picked; (my $text_new = $text_orig) =~ s/[A-Z]{3,}/ # We need more @products or a better algorithm if this executes. undef %picked if keys(%picked) > int(@products/2); my $pick; do { $pick = int(rand(@products)); } while not $picked{$pick}; $picked{$pick} = 1; $products[$pick] /eg; print($text_new); }

    Update: I thought scalar(keys(%hash)) might be O(N), but it's O(1), so all's good. Benchmarks:

Re: String replace
by tirwhan (Abbot) on Nov 29, 2005 at 18:14 UTC
    I presume you're trying to replace each string with a specific other string, i.e. XXX is always supposed to be replaced by AAA etc.
    my $text="The respondent uses the following products XXX, YYYYYYYYY, Z +ZZZZZZ around the house and they are considering using QQQQQQQ too. T +hey are particularly impressed with ZZZZZZZ."; my %replace=( "XXX" => "AAA", "YYYYYYYYY" => "BBBBBB", "ZZZZZZZ" => "CCCCCC", "QQQQQQQ" => "DDDDDDDD"); my $search=join '|',keys %replace; $text=~s/($search)/$replace{$1}/g; print $text;

    Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it. -- Brian W. Kernighan
Re: String replace
by davidrw (Prior) on Nov 29, 2005 at 18:09 UTC
    perhaps some variant on this?
    my @random_names = qw /AAA BBBBB CCCCC DDDDDD/; $s =~ s/[A-Z]{3,}/$random_names[int rand $#random_names]/sg;
    note: i assumed your products would be at least 3 characters and only have A-Z in them (i.e. no dash, underscore, numbers, etc).
    Similarily, you could replace with a hash lookup if you want all of product NNNNN replaced with the same thing.
Re: String replace
by benizi (Hermit) on Nov 29, 2005 at 23:10 UTC

    General idea: Grab the unique IN CAPITALS words from a line. For each IN CAPS, choose a distinct random product. Replace each IN CAPS with its product.

    My choose-random-set-with-no-dupes algorithm:

    For p in products-to-replace: c = int rand(@total - @chosen) For i in the sorted set of already-chosen items: last if c < i c++ Insert c into sorted set new{p} = total[c]

    I'm not sure how this algorithm stacks up against the "keep picking randomly until I get one I haven't already picked" algorithm. I'm guessing if @random_items >> $number_to_choose, it's not going to make much difference.

    Here's the final code. Note that I took your example of "set IN CAPITALS quite helpfully" to mean that products can contain spaces. (See FROSTED FLAKES example).

    my @random = map $_ x (3 + int rand 5), 'A'..'Z'; my $product_re = qr/\b[A-Z]+(?:\s[A-Z]+)*\b/; while (<DATA>) { my %newname = map { $_ => 1 } /($product_re)/g; my @chosen = (); for my $prod (keys %newname) { my $c = int rand @random - @chosen; for (@chosen) { last if $c < $_; $c++; } my $i = 0; $i++ while $i < @chosen and $chosen[$i] < $c; splice @chosen, $i, 0, $c; $newname{$prod} = $random[$c]; } s/($product_re)/$newname{$1}/g; print; } __DATA__ The respondent uses the following products XXX, YYYYY, and ZZZZZZ arou +nd the house and they are considering using QQQQQQ, too. They are par +ticularly impressed with ZZZZZZ. Joe Smith used EGGO WAFFLES, FROSTED FLAKES, and RICE KRISPIES around +the house and he is considering using POST SHREDDED WHEAT too. He is +particularly impressed with RICE KRISPIES. Bob likes CATS, DOGS, and ZEBRAS, but particularly CATS.