Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Mangling HTML to protect content, and finding stolen HTML content

by nop (Hermit)
on Nov 08, 2002 at 16:30 UTC ( [id://211462]=perlquestion: print w/replies, xml ) Need Help??

nop has asked for the wisdom of the Perl Monks concerning the following question:

Hi.

I've a question about finding and mangling text across the web. The solution will involve perl, LWP, etc. Before diving into the perl, I'm seeking advice on strategy.

I work for a company with a very content-rich website that contains detailed product information. Many fly-by-night small operators steal this text to describe the same products they're selling on Yahoo stores. This intellectual property theft is so egregious that Yahoo quickly shuts down these small sites when presented with evidence. (Usually they aren't even sophisticated to remove our brand name from their copy.) But they quickly pop back up again with new names.

  • Can anyone suggest good methods to mangle our text via HTML tags, entities, CSS, etc. so that it looks normal to a human browser, but foils the spiders and robots who steal it verbatim? The mangling would have to have some randomness to it, so that a simple script on their end couldn't unmangle. (And if such mangling existed, would it stop a person from manually cutting and pasting from the browser? I know we can't stop the cut-and-paste, but would the mangled stuff then require laborious hand editing to clean up? That'd be disincentive enough...)
  • Can anyone suggest a good algorithm to automate our detection process of our stolen content? Our current method: we run certain phrases against Yahoo or Google to pick up candidate sites, then we look at each one to see if their content is sufficiently close to ours. We can automate the search and the scan; what I'm looking for is a means to take two pages (strip the html tags) and say statistically these two pages contain paragraphs or bulleted that are are essentially the same (eg the chance of two pages on different sites matching a paragraph that closely by chance is effectively zero).
All suggestions most welcome --

nop
  • Comment on Mangling HTML to protect content, and finding stolen HTML content

Replies are listed 'Best First'.
Re: Mangling HTML to protect content, and finding stolen HTML content
by Callum (Chaplain) on Nov 08, 2002 at 17:13 UTC
    Measuring the frequency of and distance between keywords in particular contexts is widely used in detecting plagiarism, and that may be the way forward for you, coupled with some fuzzy word matching to pick out appropriation of certain keywords or stolen factual information.

    You may wish to look at plagiarism.org, a paper at Georgetown on concordances used for text comparison, and Christian Queinnec plagiarism detection script plagiat

Re: (nrd) Mangling HTML to protect content, and finding stolen HTML content
by newrisedesigns (Curate) on Nov 08, 2002 at 16:48 UTC

    What makes you think it's something so sophisticated as a robot? It's probably someone with a browser cutting and pasting your copy. CSS and HTML entities won't help you there.

    Use Server-Side Includes to #include virtual a perl script that will log visits to the page. Log the frequency, IP address (for comparing net-blocks), and the User-Agent. Also, include in a visual text footer and a comment in the source of each page a disclaimer on how "all copyright violators will be prosecuted" and other relevant legal-ese.

    If your client base is a select few, use authentication to prevent the general populace from viewing the content.

    Post a follow up if this doesn't cover what you want.

    John J Reiser
    newrisedesigns.com

      It's probably someone with a browser cutting and pasting your copy.

      I agree. We call these types of distributors 'trunk slammers' (mostly because before the advent of the web, they sold their products from the trunk of their cars and offered zero after market support). Most of them are not too bright and would view automated copy theft as something akin to reading ancient Greek.

      One of the strategies we've adopted to thwart unwanted viewing of our product info is to offer preferred customer discounts and require login before we serve up the goodies. On the stuff we do allow the general public to view, we pepper the HTML with custom tags and CSS class ids. You'd be surpised how infrequently the thieves bother to remove something like <p class="DD15893wankerbeans"> text </p> -- more proof in my mind that they are not too sophisticated in or concerned about their thievery. Hunting down stolen text is simply a matter of creating our own robots to search out these custom class names.

        earthboundmisfit++

        The CSS would work, if they copied your source, which I doubt the real idiots would do. Other than that, that's a great idea. You could go so far to include a
        <div style="display: none;">Don't be an idiot and steal this page. randomtexteasilyfoundviasearchengine </div>

        Good stuff. You don't even need a "discount" to compel someone to sign in. From my experience, most web users will sign up for anything, as long as the process isn't too complicated. And if the copy theft signs in/makes an account, you have his or her personal information. Crafty.

        Of course, you (generally speaking) shouldn't do anything more than use this to counter-act theft; if you do, outline it in the company's privacy policy, so users know exactly what's going on. I doubt your business wants a PR black eye for "stealing user information." </disclaimer>

        John J Reiser
        newrisedesigns.com

Re: Mangling HTML to protect content, and finding stolen HTML content
by kshay (Beadle) on Nov 08, 2002 at 17:46 UTC
    And if such mangling existed, would it stop a person from manually cutting and pasting from the browser? I know we can't stop the cut-and-paste, but would the mangled stuff then require laborious hand editing to clean up?
    No, probably not. No matter how much mangledness you have in there, if it looks normal to a human looking at the page, then the text, when you copy and paste it or "Save As.../Plain text," will be normal.

    Well, I guess as a radical approach you could do something like replace every other space with an "i" or some other narrow character, but put it in a font color that's the same as the background color. (You can't do every space, obviously, because then it won't word wrap.) So it'll look normal, but when you try to copy and paste it, you'll get something like this:

    Thisiis aiwonderful product.iYou shouldibuy itiimmediately.

    Of course, an actual customer who tried to copy and paste the text (say, to email it to a friend who might be interested in the product) would probably get annoyed by this.

    As for detection, what about writing a script to discover some unique "watermark" phrases in your descriptions? Here's what I mean. Let's say your product description is this (ironically, I just grabbed it from a random Yahoo store):

    Brushed moleskin knee length skirt. Patch pockets front with flirty 9" back center slit. Coco exposed stitching. Zip fly, belt loops. Enitre length of size Medium: 24". Stretchy, light-weight 96% Cotton, 4% Spandex. Hand wash cold, hang dry. Made in the USA.

    Use LWP (actually, Google frowns on you doing this sort of thing programatically, so let's assume you get a Google API key and do it all nice and proper) to search Google for each three-word phrase in succession: "brushed moleskin knee", "moleskin knee length", "knee length skirt", "length skirt patch", etc. You'd probably want to skip over any words shorter than 4 letters, because they're less likely to be part of unique phrases.

    Keep track of which phrases return zero results (use -site:mysite.com in the query to omit pages from your own site). Then a few weeks later, search for those phrases again. If you find any results, maybe you've got your plagiarist...

    Cheers,
    --Kevin

      I'd strongly recommend against putting background-colored characters in text as a substitute for spaces. That will mess up external search engines and probably your own internal search engine. It's also a pretty huge accessibility-guidelines violation -- anyone reading the page with different colors, via a text-only browser, etc., will have a badly degraded experience. What happens if you print the page and the background color drops out, as it often the case? Suddenly you have spurious letters appearing in your text...

              $perlmonks{seattlejohn} = 'John Clyman';

        Yes, I certainly don't think it's a good idea. It just came to mind as one of the few ways you might be able to munge text on a web page so that it "looks normal" but can't be copied and pasted.

        --Kevin

      Well, I guess as a radical approach you could do something like replace every other space with an "i" or some other narrow character, but put it in a font color that's the same as the background color. (You can't do every space, obviously, because then it won't word wrap.) So it'll look normal, but when you try to copy and paste it, you'll get something like this:

      Thisiis aiwonderful product.iYou shouldibuy itiimmediately.

      Of course, an actual customer who tried to copy and paste the text (say, to email it to a friend who might be interested in the product) would probably get annoyed by this.

      Don't expose the above to search engines...unless you want to be de-indexed for decades.

Re: Mangling HTML to protect content, and finding stolen HTML content
by traveler (Parson) on Nov 08, 2002 at 18:05 UTC
    One solution may be to reduce the content in the HTML. That can be done by using graphics for much of the content. This seems to be a growing trend on sites I've visited recently. It requires some creative work for search engine submission, but if the important text is in a graphic, it is possibly less vulnerable to theft, particularly if the graphic contains the a proper copyright notice. You can also encode data in the graphic using a process called steganography. See this site for some tools to help you out. I could not find a CPAN module for this.

    If you have encoded data in the image, and it is stolen, you should be able to use a decoding tool to show that it is indeed your image. Combined with a copyright embossed on the image, you are probably much safer.

    (You could also display your catalog as PDF, but there may be issues regarding plug-ins, load time, etc.)

    HTH, --traveler

      I'm fortunate enough to have DSL, but for the majority of Americans (dunno about the rest of the world), waiting 5 minutes for all the 'text' graphics to load over their good old analog modems might make them think twice about shopping there. Broadband is nowhere near universal in the US.

      --isotope
      http://www.skylab.org/~isotope/
        I don't have DSL, either. I know its an issue, but if the graphics have low enough resolution, they can be pretty fast. In fact, I have seen some graphic sites faster than some HTML if the HTML has lots of complex rendering to do. It may take some experimentation to find the best mix of graphics and html, but these days some graphics seem to load very fast, even over slow links.

        --traveler

Re: Mangling HTML to protect content, and finding stolen HTML content
by SpaceAce (Beadle) on Nov 08, 2002 at 17:49 UTC
    Whether the text is being stolen by robots or humans is pretty much moot, anyway. If the page source can be fetched by a browser, it can be fetched by a robot or a spider. Probably the thing that would be best for thwarting thieves is to password protect the website, but that isn't always what you want. The web being what it is, there are not a lot of effective ways to protect your source except to obfuscate it as much as you can with ugly HTML and excessive JavaScript. SpaceAce
Re: Mangling HTML to protect content, and finding stolen HTML content
by John M. Dlugosz (Monsignor) on Nov 08, 2002 at 22:51 UTC
    If the actual HTML source is stolen, some odd comments or affectations would be enough to detect with your own web crawler, and prove it is not original.

    With cut&paste from the browser window of a paragraph, it probably loses all of that.

    You might also embed a code via steganographic techniques, using only the content that isn't affected by formatting (so, extra spaces are out, etc.). I played around with that here.

    —John

Crude javascript solution
by nop (Hermit) on Nov 13, 2002 at 19:20 UTC
    This is nop again. OK, this is really simple and crude, but this small javascript hack does slow down the basic click-and-save image grabbers. Yes, they can view source and get the image URL there, but that's more inconvenient, plus (as our pages are graphics-rich with images, banners, logos, nav elements, etc) it takes a few moments to find the right IMG tag in the source. Here's a site using it: http://www.marvelcreations.com/priv15.html </code> Right click on their images and try to save them.

      I hate to be harsh nop but it is very crude indeed, as it doesn't work with all browsers. I viewed that page with Phoenix 0.4 and was able to right-click all I wanted. It doesn't stop someone from saving the page and the images (most modern browsers do this). It also doesn't stop some one from using software other than a browser to download everything.

        Yep. Basically stops the IE crowd. As I said, it isn't a "solution" by any means.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://211462]
Approved by newrisedesigns
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (3)
As of 2024-04-25 05:21 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found