Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Re: Mangling HTML to protect content, and finding stolen HTML content

by kshay (Beadle)
on Nov 08, 2002 at 17:46 UTC ( #211488=note: print w/replies, xml ) Need Help??


in reply to Mangling HTML to protect content, and finding stolen HTML content

And if such mangling existed, would it stop a person from manually cutting and pasting from the browser? I know we can't stop the cut-and-paste, but would the mangled stuff then require laborious hand editing to clean up?
No, probably not. No matter how much mangledness you have in there, if it looks normal to a human looking at the page, then the text, when you copy and paste it or "Save As.../Plain text," will be normal.

Well, I guess as a radical approach you could do something like replace every other space with an "i" or some other narrow character, but put it in a font color that's the same as the background color. (You can't do every space, obviously, because then it won't word wrap.) So it'll look normal, but when you try to copy and paste it, you'll get something like this:

Thisiis aiwonderful product.iYou shouldibuy itiimmediately.

Of course, an actual customer who tried to copy and paste the text (say, to email it to a friend who might be interested in the product) would probably get annoyed by this.

As for detection, what about writing a script to discover some unique "watermark" phrases in your descriptions? Here's what I mean. Let's say your product description is this (ironically, I just grabbed it from a random Yahoo store):

Brushed moleskin knee length skirt. Patch pockets front with flirty 9" back center slit. Coco exposed stitching. Zip fly, belt loops. Enitre length of size Medium: 24". Stretchy, light-weight 96% Cotton, 4% Spandex. Hand wash cold, hang dry. Made in the USA.

Use LWP (actually, Google frowns on you doing this sort of thing programatically, so let's assume you get a Google API key and do it all nice and proper) to search Google for each three-word phrase in succession: "brushed moleskin knee", "moleskin knee length", "knee length skirt", "length skirt patch", etc. You'd probably want to skip over any words shorter than 4 letters, because they're less likely to be part of unique phrases.

Keep track of which phrases return zero results (use -site:mysite.com in the query to omit pages from your own site). Then a few weeks later, search for those phrases again. If you find any results, maybe you've got your plagiarist...

Cheers,
--Kevin

  • Comment on Re: Mangling HTML to protect content, and finding stolen HTML content

Replies are listed 'Best First'.
Re: Re: Mangling HTML to protect content, and finding stolen HTML content
by seattlejohn (Deacon) on Nov 08, 2002 at 18:44 UTC
    I'd strongly recommend against putting background-colored characters in text as a substitute for spaces. That will mess up external search engines and probably your own internal search engine. It's also a pretty huge accessibility-guidelines violation -- anyone reading the page with different colors, via a text-only browser, etc., will have a badly degraded experience. What happens if you print the page and the background color drops out, as it often the case? Suddenly you have spurious letters appearing in your text...

            $perlmonks{seattlejohn} = 'John Clyman';

      Yes, I certainly don't think it's a good idea. It just came to mind as one of the few ways you might be able to munge text on a web page so that it "looks normal" but can't be copied and pasted.

      --Kevin

Re: Re: Mangling HTML to protect content, and finding stolen HTML content
by zaimoni (Beadle) on Nov 09, 2002 at 04:30 UTC

    Well, I guess as a radical approach you could do something like replace every other space with an "i" or some other narrow character, but put it in a font color that's the same as the background color. (You can't do every space, obviously, because then it won't word wrap.) So it'll look normal, but when you try to copy and paste it, you'll get something like this:

    Thisiis aiwonderful product.iYou shouldibuy itiimmediately.

    Of course, an actual customer who tried to copy and paste the text (say, to email it to a friend who might be interested in the product) would probably get annoyed by this.

    Don't expose the above to search engines...unless you want to be de-indexed for decades.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://211488]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (3)
As of 2022-07-04 11:59 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?