karld12 has asked for the wisdom of the Perl Monks concerning the following question:

Hello all! I have a question regarding GREP filtering in Privoxy. I have posted this question on the Privoxy mailing list, but I don't hold out much hope as the list is mostly for bug reports and support, not GREP solutions. So I've come here as well.

Privoxy allows the user to define filters to strip unwanted content from HTML pages. The filters are said to be Perl style GREP. Here's an example:

FILTER: webbugs Squish WebBugs (1x1 invisible GIFs used for user track +ing). s@<img[^>]*\s(?:width|height)\s*=\s*['"]?[01](?=\D)[^>]*\s(?:width|hei +ght)\s*=\s*['"]?[01](?=\D)[^>]*?>@@siUg

What I would like to do is create a filter that would remove images which are served from third-party domains. If I'm looking at http://google.com, then the following would be displayed...

http://images.google.com/someimage.jpg

...but the following would be blocked or filtered out/replaced with a blank:

http://google.somesite.org/image.jpg
http://somesite.net/google/image.jpg
http://anythingelse.com/etc.jpg

My problem is that I struggle with GREP and don't know where to start. How would I reliably establish the domain of the current page? How would I then filter the page for third-party images?

I wonder if any of you have already created such a filter and would be willing to share it. Otherwise I think I'm stuck.

Many thanks!

Karl

Replies are listed 'Best First'.
Re: GREP Question: Filtering out third-party images with Privoxy
by kcott (Archbishop) on Jan 22, 2014 at 13:17 UTC

    G'day karld12,

    Welcome to the monastery.

    I'm not familiar with Privoxy; however, looking at its User Manual, I think this regex will probably do what you want:

    s/\s*<img.*src="[^"]*(?<!\/someimage)\.jpg".*>//gm
    s/\s*<img.*src="http:\/\/(?!images\.google\.com\/)[^>]+>//gm

    Update: My apologies. I originally focussed on the image name but, on rereading your question, I see you want to exclude domains. My original solution is in the spoiler; a more appropriate solution folllows.

    Here's my test:

    #!/usr/bin/env perl use strict; use warnings; my $html_fragment = <<'END_HTML'; <img src="http://images.google.com/someimage.jpg" /> <img src="http://images.google.com/NOTsomeimage.jpg" /> <img src="http://google.somesite.org/image.jpg" /> <img src="http://somesite.net/google/image.jpg" /> <img src="http://anythingelse.com/etc.jpg" /> END_HTML print "Initial markup:\n"; print $html_fragment; $html_fragment =~ s/\s*<img.*src="[^"]*(?<!\/someimage)\.jpg".*>//gm; print "Modified markup:\n"; print $html_fragment;

    Output:

    Initial markup: <img src="http://images.google.com/someimage.jpg" /> <img src="http://images.google.com/NOTsomeimage.jpg" /> <img src="http://google.somesite.org/image.jpg" /> <img src="http://somesite.net/google/image.jpg" /> <img src="http://anythingelse.com/etc.jpg" /> Modified markup: <img src="http://images.google.com/someimage.jpg" />

    Here's my test:

    #!/usr/bin/env perl use strict; use warnings; my $html_fragment = <<'END_HTML'; <img src="http://images.google.com/someimage.jpg" /> <img src="http://images.google.com/NOTsomeimage.jpg" /> <img src="http://google.somesite.org/image.jpg" /> <img src="http://somesite.net/google/image.jpg" /> <img src="http://anythingelse.com/etc.jpg" /> END_HTML print "Initial markup:\n"; print $html_fragment; $html_fragment =~ s/\s*<img.*src="http:\/\/(?!images\.google\.com\/)[^ +>]+>//gm; print "Modified markup:\n"; print $html_fragment;

    Output:

    Initial markup: <img src="http://images.google.com/someimage.jpg" /> <img src="http://images.google.com/NOTsomeimage.jpg" /> <img src="http://google.somesite.org/image.jpg" /> <img src="http://somesite.net/google/image.jpg" /> <img src="http://anythingelse.com/etc.jpg" /> Modified markup: <img src="http://images.google.com/someimage.jpg" /> <img src="http://images.google.com/NOTsomeimage.jpg" />

    If this doesn't work for you, please provide an example of the HTML and indicate actual and expected output.

    -- Ken

      @corion: It is Perl compatible, but I don't know if it loads Perl modules. Thanks for the link to jsUri. It seems way above my head, but I'll study it.

      @Ken: Thanks for that, you've given me something to get me started. I should have been clearer, though. The requirement isn't as specific as the examples I gave. I want to be able to go to any website and strip out images that are not being served from the same domain as the page. So in the code you provided I would be looking for "google.com", the domain name, not "someimage". In addition, the domain name should be dynamically obtained somehow, not hardcoded. But to begin with, a hardcoded solution would do.

      I think what you've provided can be used, I just need to figure out how to seek the domain, not the image name.

      s/\s*<img.*src="[^"]*(?<!\/google\.com).*\.jpg".*>//gm

      Would that do it? (Apologies, it's been 10 years since I used GREP much.)

      Thanks again,

      Karl
        "I should have been clearer, ..."

        No, you were clear enough; I should have been more thorough in my reading of your question. Anyway, I've already picked up on that and updated my response.

        Here's how you'd go about using a variable domain name in Perl; I'll leave you to figure out how to implement that in Privoxy. Note: I've added a few more tests.

        #!/usr/bin/env perl use strict; use warnings; my $html_fragment = <<'END_HTML'; <img src="http://images.google.com/someimage.jpg" /> <img src="http://images.google.com/NOTsomeimage.jpg" /> <img src="http://google.somesite.org/image.jpg" /> <img src="http://somesite.net/google/image.jpg" /> <img src="http://anythingelse.com/etc.jpg" /> <img src="http://pictures.google.com/someimage.jpg" /> <img src="http://google.com/someimage.jpg" /> END_HTML my $domain_to_keep = 'google.com'; print "Initial markup:\n"; print $html_fragment; $html_fragment =~ s/\s*<img.*src="http:\/\/(?!.*\Q$domain_to_keep\E\/) +[^>]+>//gm; print "Modified markup:\n"; print $html_fragment;

        Output:

        Initial markup: <img src="http://images.google.com/someimage.jpg" /> <img src="http://images.google.com/NOTsomeimage.jpg" /> <img src="http://google.somesite.org/image.jpg" /> <img src="http://somesite.net/google/image.jpg" /> <img src="http://anythingelse.com/etc.jpg" /> <img src="http://pictures.google.com/someimage.jpg" /> <img src="http://google.com/someimage.jpg" /> Modified markup: <img src="http://images.google.com/someimage.jpg" /> <img src="http://images.google.com/NOTsomeimage.jpg" /> <img src="http://pictures.google.com/someimage.jpg" /> <img src="http://google.com/someimage.jpg" />

        -- Ken

Re: GREP Question: Filtering out third-party images with Privoxy
by Corion (Patriarch) on Jan 22, 2014 at 12:55 UTC

    The easy approach to split up an URL into its constitutent parts is to use URI. The module returns you an object that has convenient accessors for the protocol, host, port, path and query.

    If GREP filtering in Privoxy uses (Perl) regular expressions, but does not allow you to load Perl modules, maybe you can use the regular expression from jsUri.js and do further matches appropriately.

    If GREP filtering in Privoxy is not based on Perl and does not allow you to load modules and/or uses a regular expression syntax different from Perls, I'm not sure how it really relates to Perl.