http://qs321.pair.com?node_id=574176

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks,
I have tried the FAQ section, but don't seem to be getting anywhere.
I have a string that can have words or numbers, each one separated with a tab, for example:
$string='LOCAL Antony 17 Antony 23 1569';
How can I check if a string given has a word (I only check for words, not numbers) more than one time and print it?

Replies are listed 'Best First'.
Re: How do you find duplicates in a string?
by davido (Cardinal) on Sep 21, 2006 at 16:20 UTC

    You can use regular expression capturing, and backreferences.

    if( $string =~ m/ \b([[:alpha:]]+)\b # Match and capture word .*? # Skip what you don't need \b\1\b # Match the captured word /x ) { print $1, "\n"; }

    Limitation: Words can contain only alpha characters. You could modify the expression: [[:alpha:]] so as to include what you might consider to be legal word characters, such as ' (apostrophe) and - (hyphen). I used the /x modifier to facilitate grouping the regular expression's sub-expressions into meaningful clusters so that it's easier to read.

    Hope this helps!


    Dave

Re: How do you find duplicates in a string?
by jdporter (Paladin) on Sep 21, 2006 at 16:27 UTC
    use Scalar::Util qw( looks_like_number ); my %h; $h{$_}++ for split /\t/, $string; print "$_\n" for grep { $h{$_}>1 and !looks_like_number($_) } sort keys %h;
    We're building the house of the future together.
Re: How do you find duplicates in a string?
by ptum (Priest) on Sep 21, 2006 at 16:30 UTC

    In this particular example, since the words are tab-separated, I'd probably use a brute-force approach and split the string, then step through the resulting array, incrementing a hash. Admittedly, this is not a very efficient solution for any significant amount of data.

    use strict; use warnings; my $string = 'LOCAL Antony 17 Antony 23 1569'; my @tokens = split /\t/,$string; my %duphash = (); foreach (@tokens) { $duphash{$_}++; }

    Now %duphash has a 1 for each unique value and something greater than 1 for any duplicates.

    Update: I ignored your requirement for skipping over numbers -- adjust accordingly.

      Admittedly, this is not a very efficient solution for any significant amount of data.

      For large — increasingly large — amounts of data, it's better than the regex with backref solution (e.g. as presented by davido). Put simply, the split-hash approach is O(n), whereas the regex-backref approach is O(n2)

      We're building the house of the future together.
        Sorry guys, my mistake.
        What I want is only to know if the string has something more than one times.
        I understand that you split the string using tab as delimiter, but what must I do to check if the array that is produced and contains all elements of the string has any duplicates in it? Just that, I don't want to know what the dulicates are, I only want to know if there are any...