Mangling HTML to protect content, and finding stolen HTML content

nop has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Mangling HTML to protect content, and finding stolen HTML content by Callum (Chaplain) on Nov 08, 2002 at 17:13 UTC
Measuring the frequency of and distance between keywords in particular contexts is widely used in detecting plagiarism, and that may be the way forward for you, coupled with some fuzzy word matching to pick out appropriation of certain keywords or stolen factual information. You may wish to look at plagiarism.org, a paper at Georgetown on concordances used for text comparison, and Christian Queinnec plagiarism detection script plagiat	[reply]
Re: (nrd) Mangling HTML to protect content, and finding stolen HTML content by newrisedesigns (Curate) on Nov 08, 2002 at 16:48 UTC
What makes you think it's something so sophisticated as a robot? It's probably someone with a browser cutting and pasting your copy. CSS and HTML entities won't help you there. Use Server-Side Includes to `#include virtual` a perl script that will log visits to the page. Log the frequency, IP address (for comparing net-blocks), and the User-Agent. Also, include in a visual text footer and a comment in the source of each page a disclaimer on how "all copyright violators will be prosecuted" and other relevant legal-ese. If your client base is a select few, use authentication to prevent the general populace from viewing the content. Post a follow up if this doesn't cover what you want. John J Reiser newrisedesigns.com	[reply]
Re: Re: (nrd) Mangling HTML to protect content, and finding stolen HTML content by earthboundmisfit (Chaplain) on Nov 08, 2002 at 17:21 UTC
It's probably someone with a browser cutting and pasting your copy. I agree. We call these types of distributors 'trunk slammers' (mostly because before the advent of the web, they sold their products from the trunk of their cars and offered zero after market support). Most of them are not too bright and would view automated copy theft as something akin to reading ancient Greek. One of the strategies we've adopted to thwart unwanted viewing of our product info is to offer preferred customer discounts and require login before we serve up the goodies. On the stuff we do allow the general public to view, we pepper the HTML with custom tags and CSS class ids. You'd be surpised how infrequently the thieves bother to remove something like `<p class="DD15893wankerbeans"> text </p>` -- more proof in my mind that they are not too sophisticated in or concerned about their thievery. Hunting down stolen text is simply a matter of creating our own robots to search out these custom class names.	[reply]
Re^3: (nrd) Mangling HTML to protect content, and finding stolen HTML content by newrisedesigns (Curate) on Nov 08, 2002 at 18:10 UTC
earthboundmisfit++ The CSS would work, if they copied your source, which I doubt the real idiots would do. Other than that, that's a great idea. You could go so far to include a <div style="display: none;">Don't be an idiot and steal this page. randomtexteasilyfoundviasearchengine </div> Good stuff. You don't even need a "discount" to compel someone to sign in. From my experience, most web users will sign up for anything, as long as the process isn't too complicated. And if the copy theft signs in/makes an account, you have his or her personal information. Crafty. Of course, you (generally speaking) shouldn't do anything more than use this to counter-act theft; if you do, outline it in the company's privacy policy, so users know exactly what's going on. I doubt your business wants a PR black eye for "stealing user information." </disclaimer> John J Reiser newrisedesigns.com	[reply]
Re: Mangling HTML to protect content, and finding stolen HTML content by kshay (Beadle) on Nov 08, 2002 at 17:46 UTC
And if such mangling existed, would it stop a person from manually cutting and pasting from the browser? I know we can't stop the cut-and-paste, but would the mangled stuff then require laborious hand editing to clean up? No, probably not. No matter how much mangledness you have in there, if it looks normal to a human looking at the page, then the text, when you copy and paste it or "Save As.../Plain text," will be normal. Well, I guess as a radical approach you could do something like replace every other space with an "i" or some other narrow character, but put it in a font color that's the same as the background color. (You can't do every space, obviously, because then it won't word wrap.) So it'll look normal, but when you try to copy and paste it, you'll get something like this: Thisiis aiwonderful product.iYou shouldibuy itiimmediately. Of course, an actual customer who tried to copy and paste the text (say, to email it to a friend who might be interested in the product) would probably get annoyed by this. As for detection, what about writing a script to discover some unique "watermark" phrases in your descriptions? Here's what I mean. Let's say your product description is this (ironically, I just grabbed it from a random Yahoo store): Brushed moleskin knee length skirt. Patch pockets front with flirty 9" back center slit. Coco exposed stitching. Zip fly, belt loops. Enitre length of size Medium: 24". Stretchy, light-weight 96% Cotton, 4% Spandex. Hand wash cold, hang dry. Made in the USA. Use LWP (actually, Google frowns on you doing this sort of thing programatically, so let's assume you get a Google API key and do it all nice and proper) to search Google for each three-word phrase in succession: "brushed moleskin knee", "moleskin knee length", "knee length skirt", "length skirt patch", etc. You'd probably want to skip over any words shorter than 4 letters, because they're less likely to be part of unique phrases. Keep track of which phrases return zero results (use `-site:mysite.com` in the query to omit pages from your own site). Then a few weeks later, search for those phrases again. If you find any results, maybe you've got your plagiarist... Cheers, --Kevin	[reply]
Re: Re: Mangling HTML to protect content, and finding stolen HTML content by seattlejohn (Deacon) on Nov 08, 2002 at 18:44 UTC
I'd strongly recommend against putting background-colored characters in text as a substitute for spaces. That will mess up external search engines and probably your own internal search engine. It's also a pretty huge accessibility-guidelines violation -- anyone reading the page with different colors, via a text-only browser, etc., will have a badly degraded experience. What happens if you print the page and the background color drops out, as it often the case? Suddenly you have spurious letters appearing in your text... $perlmonks{seattlejohn} = 'John Clyman';	[reply]
Re: Re: Re: Mangling HTML to protect content, and finding stolen HTML content by kshay (Beadle) on Nov 08, 2002 at 20:26 UTC
Yes, I certainly don't think it's a good idea. It just came to mind as one of the few ways you might be able to munge text on a web page so that it "looks normal" but can't be copied and pasted. --Kevin	[reply]
Re: Re: Mangling HTML to protect content, and finding stolen HTML content by zaimoni (Beadle) on Nov 09, 2002 at 04:30 UTC
Well, I guess as a radical approach you could do something like replace every other space with an "i" or some other narrow character, but put it in a font color that's the same as the background color. (You can't do every space, obviously, because then it won't word wrap.) So it'll look normal, but when you try to copy and paste it, you'll get something like this: Thisiis aiwonderful product.iYou shouldibuy itiimmediately. Of course, an actual customer who tried to copy and paste the text (say, to email it to a friend who might be interested in the product) would probably get annoyed by this. Don't expose the above to search engines...unless you want to be de-indexed for decades.	[reply]
Re: Mangling HTML to protect content, and finding stolen HTML content by traveler (Parson) on Nov 08, 2002 at 18:05 UTC
One solution may be to reduce the content in the HTML. That can be done by using graphics for much of the content. This seems to be a growing trend on sites I've visited recently. It requires some creative work for search engine submission, but if the important text is in a graphic, it is possibly less vulnerable to theft, particularly if the graphic contains the a proper copyright notice. You can also encode data in the graphic using a process called steganography. See this site for some tools to help you out. I could not find a CPAN module for this. If you have encoded data in the image, and it is stolen, you should be able to use a decoding tool to show that it is indeed your image. Combined with a copyright embossed on the image, you are probably much safer. (You could also display your catalog as PDF, but there may be issues regarding plug-ins, load time, etc.) HTH, --traveler	[reply]
Re: Re: Mangling HTML to protect content, and finding stolen HTML content by isotope (Deacon) on Nov 08, 2002 at 18:30 UTC
I'm fortunate enough to have DSL, but for the majority of Americans (dunno about the rest of the world), waiting 5 minutes for all the 'text' graphics to load over their good old analog modems might make them think twice about shopping there. Broadband is nowhere near universal in the US. --isotope http://www.skylab.org/~isotope/	[reply]
Re: Re: Re: Mangling HTML to protect content, and finding stolen HTML content by traveler (Parson) on Nov 08, 2002 at 18:54 UTC
I don't have DSL, either. I know its an issue, but if the graphics have low enough resolution, they can be pretty fast. In fact, I have seen some graphic sites faster than some HTML if the HTML has lots of complex rendering to do. It may take some experimentation to find the best mix of graphics and html, but these days some graphics seem to load very fast, even over slow links. --traveler	[reply]
Re: Mangling HTML to protect content, and finding stolen HTML content by SpaceAce (Beadle) on Nov 08, 2002 at 17:49 UTC
Whether the text is being stolen by robots or humans is pretty much moot, anyway. If the page source can be fetched by a browser, it can be fetched by a robot or a spider. Probably the thing that would be best for thwarting thieves is to password protect the website, but that isn't always what you want. The web being what it is, there are not a lot of effective ways to protect your source except to obfuscate it as much as you can with ugly HTML and excessive JavaScript. SpaceAce	[reply]
Re: Mangling HTML to protect content, and finding stolen HTML content by John M. Dlugosz (Monsignor) on Nov 08, 2002 at 22:51 UTC
If the actual HTML source is stolen, some odd comments or affectations would be enough to detect with your own web crawler, and prove it is not original. With cut&paste from the browser window of a paragraph, it probably loses all of that. You might also embed a code via steganographic techniques, using only the content that isn't affected by formatting (so, extra spaces are out, etc.). I played around with that here. —John	[reply]
Crude javascript solution by nop (Hermit) on Nov 13, 2002 at 19:20 UTC
This is nop again. OK, this is really simple and crude, but this small javascript hack does slow down the basic click-and-save image grabbers. Yes, they can view source and get the image URL there, but that's more inconvenient, plus (as our pages are graphics-rich with images, banners, logos, nav elements, etc) it takes a few moments to find the right IMG tag in the source. Here's a site using it: http://www.marvelcreations.com/priv15.html </code> Right click on their images and try to save them.	[reply]
Re: Crude javascript solution by Mr. Muskrat (Canon) on Nov 13, 2002 at 19:32 UTC
I hate to be harsh nop but it is very crude indeed, as it doesn't work with all browsers. I viewed that page with Phoenix 0.4 and was able to right-click all I wanted. It doesn't stop someone from saving the page and the images (most modern browsers do this). It also doesn't stop some one from using software other than a browser to download everything.	[reply]
Re: Re: Crude javascript solution by nop (Hermit) on Nov 13, 2002 at 22:21 UTC
Yep. Basically stops the IE crowd. As I said, it isn't a "solution" by any means.	[reply]


Keep It Simple, Stupid
	PerlMonks