Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Re^2: general advice finding duplicate code

by Anonymous Monk
on Jun 21, 2011 at 06:49 UTC ( [id://910698]=note: print w/replies, xml ) Need Help??


in reply to Re: general advice finding duplicate code
in thread general advice finding duplicate code

looks like will only identify duplicated but individual lines of code across the scripts

Every approach is this approach :) its like a search engine

You iterate over you files, and you index each file

To index, you pick a unit (ex one word, or three adjacent lines of code)

Generate a list of all units for a file

Normalize each unit. For words you would stem (remove prefix/suffix..) to find the root, for lines you would remove insignificant whitespace, insignificant commas... normalize quoting characters...

Hash each unit (sha1), and associate all this in a database

Then, to find duplication, query the database to find duplicate hashes

This is not unlike what git (git gc) does, so I wouldn't be surprised if git provides provided a tool to help you visualize these duplications, although I don't know of one

It goes without saying before making code changes, you need a comprehensive test suite :)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://910698]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others learning in the Monastery: (3)
As of 2024-03-29 02:02 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found