Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

Interesting, I started solving a very similar problem for planetscape some time ago. In her case she was wanting to refactor a web site where there were large chunks of duplicated HTML. The general approach was to normalise the HTML then extract chunks of some minimum size and populate a hash using the chunks as a key and adding the file location of each chunk to a list stored in the hash element. The interesting part is to then use the matched chunks as seed points and grow the match area to encompass as large a common region as sensible.

With the HTML matching choosing "sensible" was something of a trade off. As the common region was increased the number of places that matched the region tended to reduce. It may be that in your case if the code rally has just been copied around without change that the regions are pretty well defined.

My guess is that for something small like 55K LOC the technique would work quite well and in a timely fashion. Probably you don't need to worry so much about the normalise step.

True laziness is hard work

In reply to Re: general advice finding duplicate code by GrandFather
in thread general advice finding duplicate code by aquarium

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (7)
As of 2024-03-29 08:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found