http://qs321.pair.com?node_id=1036464


in reply to Re: Remove duplicate from the same line..
in thread Remove duplicate from the same line..

Thx. Athanasius... That was good.

but now I got 1 more problem, I have a company name like "Goldman Sachs Group, ... Goldman Sachs Group, Inc." Here I want only 'Goldman Sachs Group'. Is there any option for that?

Basically If a word appear second time in a single line delete the rest of the line including that word and trim out if any , or . is present!. Is that possible?

& Thx rpnoble, I will do it like that... :)

Thx.

  • Comment on Re^2: Remove duplicate from the same line..

Replies are listed 'Best First'.
Re^3: Remove duplicate from the same line..
by gam3 (Curate) on Jun 01, 2013 at 18:51 UTC
    As a counter example I submit: 'Smith Smith & Feeley LLP'
    -- gam3
    A picture is worth a thousand words, but takes 200K.

      This is a very good objection to the whole exercise. There is not really any way to know whether Smith Smith, Inc is a duplication or a valid company name. Without some real world knowledge I cannot see a way to distinguish between the two. As a remediation one could write all replacements into a log file for review and build a list of exceptions.

        By suggesting a “separate file with a list of replacements,” I think that you just hit the nail on the head.   This is obviously a human-generated list, with variations in names that (humans know ...) refer to the same legal entity.   It would be quite difficult to write a completely satisfactory algorithm to “conclude that” some particular replacement should be done.   But, if you could provide a (human-generated and human-maintained) list of the replacements, then you could not only sanitize the list effectively, but you could also control and guide its operation.

        For example, let’s say that you have a data-file containing records such as:

        Goldman Sachs, LLC => Goldman Sachs

        A Perl program could now read that file, split()ting it of course on /\s*\=\>\s*/, and thereby obtain a hash of “strings to be substituted,” and of “substitution strings,” and of the mappings from one to the other.   An input-record is interesting if it is contains any string that calls for substitution, and also if it contains more than one occurrence of an interesting string (which is taken to mean that the subsequent occurrences should be removed).   The algorithm can be diddled as needed ... it is now human-controlled.

        Finally, a filter-program could be constructed which scans the file for strings which contain more-than-one occurrence of the same alphanumeric token, e.g. Goldman.   A human would eyeball that list and add to the substitutions-file as he or she deems fit.

Re^3: Remove duplicate from the same line..
by hdb (Monsignor) on Jun 01, 2013 at 17:00 UTC

    Adding .* after \1 in Atanasius' solution should do the trick as it matches everything up to the trailing \n.

      No, it didnt worked!

        You are not very open to experimentation, are you?

        use strict; use warnings; my $str = "Goldman Sachs Group, ... Goldman Sachs Group, Inc."; $str =~ s/ \b (.+) \b [,.\s]* \1.* /$1/gx; print "$str\n";