And if such mangling existed, would it stop a person from manually cutting and pasting from the browser? I know we can't stop the cut-and-paste, but would the mangled stuff then require laborious hand editing to clean up?
No, probably not. No matter how much mangledness you have in there, if it looks normal to a human looking at the page, then the text, when you copy and paste it or "Save As.../Plain text," will be normal.

Well, I guess as a radical approach you could do something like replace every other space with an "i" or some other narrow character, but put it in a font color that's the same as the background color. (You can't do every space, obviously, because then it won't word wrap.) So it'll look normal, but when you try to copy and paste it, you'll get something like this:

Thisiis aiwonderful product.iYou shouldibuy itiimmediately.

Of course, an actual customer who tried to copy and paste the text (say, to email it to a friend who might be interested in the product) would probably get annoyed by this.

As for detection, what about writing a script to discover some unique "watermark" phrases in your descriptions? Here's what I mean. Let's say your product description is this (ironically, I just grabbed it from a random Yahoo store):

Brushed moleskin knee length skirt. Patch pockets front with flirty 9" back center slit. Coco exposed stitching. Zip fly, belt loops. Enitre length of size Medium: 24". Stretchy, light-weight 96% Cotton, 4% Spandex. Hand wash cold, hang dry. Made in the USA.

Use LWP (actually, Google frowns on you doing this sort of thing programatically, so let's assume you get a Google API key and do it all nice and proper) to search Google for each three-word phrase in succession: "brushed moleskin knee", "moleskin knee length", "knee length skirt", "length skirt patch", etc. You'd probably want to skip over any words shorter than 4 letters, because they're less likely to be part of unique phrases.

Keep track of which phrases return zero results (use -site:mysite.com in the query to omit pages from your own site). Then a few weeks later, search for those phrases again. If you find any results, maybe you've got your plagiarist...

Cheers,
--Kevin


In reply to Re: Mangling HTML to protect content, and finding stolen HTML content by kshay
in thread Mangling HTML to protect content, and finding stolen HTML content by nop

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.