http://qs321.pair.com?node_id=1036448

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

I have a basic question to remove a duplicate word from a line.

The word is inside a string, ok, lets consider it as $string, I have a paragraph inside that string and there is a line where duplicates is coming and I want to remove it...

Example data
--------------------------- Allscripts LLC - Long Island City, NY May 31, 2013 Job Summary Company Allscripts LLC Allscripts LLC Location Long Island City, NY Job Type Regular Job Classification Full Time Experience not provided Education not provided Company Ref # 014625014625 AJE Ref # 561949312 -----------------------------

Here Just next to 'Company' there is 'Allscripts LLC Allscripts LLC', I just need it once and it should be like 'Allscripts LLC' instead of 'Allscripts LLC Allscripts LLC'. So the output want is like,

------------------------------------ Allscripts LLC - Long Island City, NY May 31, 2013 Job Summary Company Allscripts LLC # (Changes in this line) Location Long Island City, NY Job Type Regular Job Classification Full Time Experience not provided Education not provided Company Ref # 014625014625 AJE Ref # 561949312 ------------------------------------

The name should be any names & not only just "Allscripts LLC". It can be some other names like "TechnoCats" "GLOBEMASTERS" etc.. etc.. I need a universal solution.

I am not getting how to do this properly, Can any Monks pls suggest me a way to do this effectively.

Regards,

Galonet

Replies are listed 'Best First'.
Re: Remove duplicate from the same line.
by Athanasius (Archbishop) on Jun 01, 2013 at 15:05 UTC

    The following regex will remove any word or phrase that duplicates its immediate predecessor:

    $string =~ s/ \b (.+) \b \s* \1 /$1/gx;

    But note that an address such as “Long Island City, NY NY” will be reduced to “Long Island City, NY”.

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

      Thx. Athanasius. but I have some more doubts, pls can you check the reply I gave to rpnoble
Re: Remove duplicate from the same line..
by rpnoble419 (Pilgrim) on Jun 01, 2013 at 15:14 UTC

    Are you getting this duplication only on the Company Name line? If so, wrap the solution from Athanasius in an if test when you read the line from your file. Otherwise you can damage any address information as warned by Athanasius. Can you get a look at the system that is causing the problem in the first place? That might be your better long term solution..

      Thx. Athanasius... That was good.

      but now I got 1 more problem, I have a company name like "Goldman Sachs Group, ... Goldman Sachs Group, Inc." Here I want only 'Goldman Sachs Group'. Is there any option for that?

      Basically If a word appear second time in a single line delete the rest of the line including that word and trim out if any , or . is present!. Is that possible?

      & Thx rpnoble, I will do it like that... :)

      Thx.

        As a counter example I submit: 'Smith Smith & Feeley LLP'
        -- gam3
        A picture is worth a thousand words, but takes 200K.

        Adding .* after \1 in Atanasius' solution should do the trick as it matches everything up to the trailing \n.