in reply to GREP Question: Filtering out third-party images with Privoxy

G'day karld12,

Welcome to the monastery.

I'm not familiar with Privoxy; however, looking at its User Manual, I think this regex will probably do what you want:

s/\s*<img.*src="[^"]*(?<!\/someimage)\.jpg".*>//gm
s/\s*<img.*src="http:\/\/(?!images\.google\.com\/)[^>]+>//gm

Update: My apologies. I originally focussed on the image name but, on rereading your question, I see you want to exclude domains. My original solution is in the spoiler; a more appropriate solution folllows.

Here's my test:

#!/usr/bin/env perl use strict; use warnings; my $html_fragment = <<'END_HTML'; <img src="http://images.google.com/someimage.jpg" /> <img src="http://images.google.com/NOTsomeimage.jpg" /> <img src="http://google.somesite.org/image.jpg" /> <img src="http://somesite.net/google/image.jpg" /> <img src="http://anythingelse.com/etc.jpg" /> END_HTML print "Initial markup:\n"; print $html_fragment; $html_fragment =~ s/\s*<img.*src="http:\/\/(?!images\.google\.com\/)[^ +>]+>//gm; print "Modified markup:\n"; print $html_fragment;

Output:

Initial markup: <img src="http://images.google.com/someimage.jpg" /> <img src="http://images.google.com/NOTsomeimage.jpg" /> <img src="http://google.somesite.org/image.jpg" /> <img src="http://somesite.net/google/image.jpg" /> <img src="http://anythingelse.com/etc.jpg" /> Modified markup: <img src="http://images.google.com/someimage.jpg" /> <img src="http://images.google.com/NOTsomeimage.jpg" />

If this doesn't work for you, please provide an example of the HTML and indicate actual and expected output.

-- Ken

Replies are listed 'Best First'.
Re^2: GREP Question: Filtering out third-party images with Privoxy
by karld12 (Initiate) on Jan 22, 2014 at 13:51 UTC

    @corion: It is Perl compatible, but I don't know if it loads Perl modules. Thanks for the link to jsUri. It seems way above my head, but I'll study it.

    @Ken: Thanks for that, you've given me something to get me started. I should have been clearer, though. The requirement isn't as specific as the examples I gave. I want to be able to go to any website and strip out images that are not being served from the same domain as the page. So in the code you provided I would be looking for "google.com", the domain name, not "someimage". In addition, the domain name should be dynamically obtained somehow, not hardcoded. But to begin with, a hardcoded solution would do.

    I think what you've provided can be used, I just need to figure out how to seek the domain, not the image name.

    s/\s*<img.*src="[^"]*(?<!\/google\.com).*\.jpg".*>//gm

    Would that do it? (Apologies, it's been 10 years since I used GREP much.)

    Thanks again,

    Karl
      "I should have been clearer, ..."

      No, you were clear enough; I should have been more thorough in my reading of your question. Anyway, I've already picked up on that and updated my response.

      Here's how you'd go about using a variable domain name in Perl; I'll leave you to figure out how to implement that in Privoxy. Note: I've added a few more tests.

      #!/usr/bin/env perl use strict; use warnings; my $html_fragment = <<'END_HTML'; <img src="http://images.google.com/someimage.jpg" /> <img src="http://images.google.com/NOTsomeimage.jpg" /> <img src="http://google.somesite.org/image.jpg" /> <img src="http://somesite.net/google/image.jpg" /> <img src="http://anythingelse.com/etc.jpg" /> <img src="http://pictures.google.com/someimage.jpg" /> <img src="http://google.com/someimage.jpg" /> END_HTML my $domain_to_keep = 'google.com'; print "Initial markup:\n"; print $html_fragment; $html_fragment =~ s/\s*<img.*src="http:\/\/(?!.*\Q$domain_to_keep\E\/) +[^>]+>//gm; print "Modified markup:\n"; print $html_fragment;

      Output:

      Initial markup: <img src="http://images.google.com/someimage.jpg" /> <img src="http://images.google.com/NOTsomeimage.jpg" /> <img src="http://google.somesite.org/image.jpg" /> <img src="http://somesite.net/google/image.jpg" /> <img src="http://anythingelse.com/etc.jpg" /> <img src="http://pictures.google.com/someimage.jpg" /> <img src="http://google.com/someimage.jpg" /> Modified markup: <img src="http://images.google.com/someimage.jpg" /> <img src="http://images.google.com/NOTsomeimage.jpg" /> <img src="http://pictures.google.com/someimage.jpg" /> <img src="http://google.com/someimage.jpg" />

      -- Ken

        Thanks again, Ken. After more research and learning I managed to make it work. Though in the end, I settled on simply deleting all .gif images instead of looking for third-party serving because in practice they tend to be hosted locally, even if the link is pointing elsewhere. (But the third-party search can be done, Privoxy provides a variable $host and custom option 'D' to use it.) As an aside, I must say it's not easy finding info on Perl-style regex/grep if you're not actually coding in Perl. For the life of me I haven't been able to find authoritative info on setting delimiters, and was thrown by Privoxy's liberal use of things other than "/". And so on. Will keep learning! Karl