in reply to Re: GREP Question: Filtering out third-party images with Privoxy
in thread GREP Question: Filtering out third-party images with Privoxy

@corion: It is Perl compatible, but I don't know if it loads Perl modules. Thanks for the link to jsUri. It seems way above my head, but I'll study it.

@Ken: Thanks for that, you've given me something to get me started. I should have been clearer, though. The requirement isn't as specific as the examples I gave. I want to be able to go to any website and strip out images that are not being served from the same domain as the page. So in the code you provided I would be looking for "google.com", the domain name, not "someimage". In addition, the domain name should be dynamically obtained somehow, not hardcoded. But to begin with, a hardcoded solution would do.

I think what you've provided can be used, I just need to figure out how to seek the domain, not the image name.

s/\s*<img.*src="[^"]*(?<!\/google\.com).*\.jpg".*>//gm

Would that do it? (Apologies, it's been 10 years since I used GREP much.)

Thanks again,

Karl
  • Comment on Re^2: GREP Question: Filtering out third-party images with Privoxy
  • Download Code

Replies are listed 'Best First'.
Re^3: GREP Question: Filtering out third-party images with Privoxy
by kcott (Archbishop) on Jan 22, 2014 at 14:15 UTC
    "I should have been clearer, ..."

    No, you were clear enough; I should have been more thorough in my reading of your question. Anyway, I've already picked up on that and updated my response.

    Here's how you'd go about using a variable domain name in Perl; I'll leave you to figure out how to implement that in Privoxy. Note: I've added a few more tests.

    #!/usr/bin/env perl use strict; use warnings; my $html_fragment = <<'END_HTML'; <img src="http://images.google.com/someimage.jpg" /> <img src="http://images.google.com/NOTsomeimage.jpg" /> <img src="http://google.somesite.org/image.jpg" /> <img src="http://somesite.net/google/image.jpg" /> <img src="http://anythingelse.com/etc.jpg" /> <img src="http://pictures.google.com/someimage.jpg" /> <img src="http://google.com/someimage.jpg" /> END_HTML my $domain_to_keep = 'google.com'; print "Initial markup:\n"; print $html_fragment; $html_fragment =~ s/\s*<img.*src="http:\/\/(?!.*\Q$domain_to_keep\E\/) +[^>]+>//gm; print "Modified markup:\n"; print $html_fragment;

    Output:

    Initial markup: <img src="http://images.google.com/someimage.jpg" /> <img src="http://images.google.com/NOTsomeimage.jpg" /> <img src="http://google.somesite.org/image.jpg" /> <img src="http://somesite.net/google/image.jpg" /> <img src="http://anythingelse.com/etc.jpg" /> <img src="http://pictures.google.com/someimage.jpg" /> <img src="http://google.com/someimage.jpg" /> Modified markup: <img src="http://images.google.com/someimage.jpg" /> <img src="http://images.google.com/NOTsomeimage.jpg" /> <img src="http://pictures.google.com/someimage.jpg" /> <img src="http://google.com/someimage.jpg" />

    -- Ken

      Thanks again, Ken. After more research and learning I managed to make it work. Though in the end, I settled on simply deleting all .gif images instead of looking for third-party serving because in practice they tend to be hosted locally, even if the link is pointing elsewhere. (But the third-party search can be done, Privoxy provides a variable $host and custom option 'D' to use it.) As an aside, I must say it's not easy finding info on Perl-style regex/grep if you're not actually coding in Perl. For the life of me I haven't been able to find authoritative info on setting delimiters, and was thrown by Privoxy's liberal use of things other than "/". And so on. Will keep learning! Karl
        "As an aside, I must say it's not easy finding info on Perl-style regex/grep if you're not actually coding in Perl. For the life of me I haven't been able to find authoritative info on setting delimiters, and was thrown by Privoxy's liberal use of things other than "/". And so on. Will keep learning!"

        A good place to start would be "perlretut - Perl regular expressions tutorial". This has links to further, relevant information (including more detailed descriptions of the topics covered in the tutorial).

        Here's the documentation for grep.

        If you have questions arising from any of that documentation, feel free to ask but it would probably be better to raise them in a new thread. Also, the guidelines in "How do I post a question effectively?" will help in getting the best answers.

        -- Ken