comment on

And if such mangling existed, would it stop a person from manually cutting and pasting from the browser? I know we can't stop the cut-and-paste, but would the mangled stuff then require laborious hand editing to clean up?

No, probably not. No matter how much mangledness you have in there, if it looks normal to a human looking at the page, then the text, when you copy and paste it or "Save As.../Plain text," will be normal.

Well, I guess as a radical approach you could do something like replace every other space with an "i" or some other narrow character, but put it in a font color that's the same as the background color. (You can't do every space, obviously, because then it won't word wrap.) So it'll look normal, but when you try to copy and paste it, you'll get something like this:

Thisiis aiwonderful product.iYou shouldibuy itiimmediately.

Of course, an actual customer who tried to copy and paste the text (say, to email it to a friend who might be interested in the product) would probably get annoyed by this.

As for detection, what about writing a script to discover some unique "watermark" phrases in your descriptions? Here's what I mean. Let's say your product description is this (ironically, I just grabbed it from a random Yahoo store):

Brushed moleskin knee length skirt. Patch pockets front with flirty 9" back center slit. Coco exposed stitching. Zip fly, belt loops. Enitre length of size Medium: 24". Stretchy, light-weight 96% Cotton, 4% Spandex. Hand wash cold, hang dry. Made in the USA.

Use LWP (actually, Google frowns on you doing this sort of thing programatically, so let's assume you get a Google API key and do it all nice and proper) to search Google for each three-word phrase in succession: "brushed moleskin knee", "moleskin knee length", "knee length skirt", "length skirt patch", etc. You'd probably want to skip over any words shorter than 4 letters, because they're less likely to be part of unique phrases.

Keep track of which phrases return zero results (use -site:mysite.com in the query to omit pages from your own site). Then a few weeks later, search for those phrases again. If you find any results, maybe you've got your plagiarist...

Cheers,
--Kevin

In reply to Re: Mangling HTML to protect content, and finding stolen HTML content by kshay
in thread Mangling HTML to protect content, and finding stolen HTML content by nop

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Come for the quick hacks, stay for the epiphanies.
	PerlMonks