AlanOfDale has asked for the wisdom of the Perl Monks concerning the following question:

Hi my name is Alan and I have some very stupid questions. I am a total Newbie as far as perl goes and I need some help.

Now that that is out of the way, let me explain what I am trying to do. I have a small project called host file from hell. It does 2 things blocks ad server and it blocks adult-type servers. To date I have maintained it manually with Wordpad and excel. It currently blocks over 62000 differnet servers(www.alanofdale.net/download/hosts.zip). I do this to try and help parents keep the kids away from those sites and to keep ads to a minumal.

A friends kids just found a rather large kink in my program. A new site that does some rather nasty things to winders. I have used a program to get all the text files on that server and all text files linked from that server(30,000 in total or 4 meg). Since it would not be practical for me to manually go through each file to find all the of web address I want to automate the process.

I have been told that perl can do this. But I don't even know where to start. I have 3 perl books(perl cookbook, learning perl 3rd edition, and programing perl 3rd edition).

Where do I start?

update (broquaint): title change (was Total newbie asking stupid questions)

Replies are listed 'Best First'.
Re: use Perl to maintain blocked host file
by dragonchild (Archbishop) on Jan 20, 2004 at 15:49 UTC
    What's winders? Windows? If you're asking a question, don't throw in stupidisms. The harder it is for me to parse your question, the less likely I am to answer it.

    Assuming you want to parse each text file for things that look like sitenames ... this is an easy (if time-consuming) task in Perl. Try something like:

    while (<>) { open IN, $_ || die $!; while (<IN>) { if (m!https?://([^/"'\s]+)!) { print "'$1'\n"; } } close IN; }

    This code is untested. Please try and figure out what it's doing before you tell others it works. It also isn't complete and is easily fooled. It's an 80/20 solution, useful only if 80/20 solutions are acceptable. Much better would be to employ something like URI or HTML::Parser.

    ------
    We are the carpenters and bricklayers of the Information Age.

    Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

Re: use Perl to maintain blocked host file
by Abigail-II (Bishop) on Jan 20, 2004 at 15:39 UTC
    Perl can do this, and so can almost any other general purpose language. You'd start by learning Perl (take a course, read the manual, study one or more books that teach Perl). Alternatively, you'd program this exercise in a language you do know.

    Abigail

Re: use Perl to maintain blocked host file
by inman (Curate) on Jan 20, 2004 at 16:27 UTC
    The code below simply parses your hosts file and turns it into a web page full of links which you can then check manually. You could update the code to actually 'get' the URL and test it. (I won't be doing that from behind my corporate proxy!). There are also freely available applications that will check whether links are active.

    One potential problem is that the dubious web sites (particularly advertising ones) do not always have a homepage. If you try and access http://mysite.com/ they can just return a 404 error which may lead you to assume that the site cannot be found or served etc.

    #! /usr/bin/perl -w use strict; print "<html><body>\n"; while (<DATA>) { #remove leading and trailing whitespace s/^\s+//; s/\s+$//; #split into an array my ($dummy, $url) = split /\t/, $_; #do something with the url print "<a href=\"http://$url\">$url</a><br>\n" if $url; } print "</body></html>\n"; __DATA__ ################################################################### + # # + # Created by Alan Bradley on 1-03-04 # + # # + # this list currently has 61882 different blocked servers # + # # + # # + # # + ################################################################### 0.0.0.0 00.goodoo.ru 0.0.0.0 00.smi.ru 0.0.0.0 000.2.links4trade.com 0.0.0.0 0000hits.net etc.
Re: use Perl to maintain blocked host file
by l3nz (Friar) on Jan 20, 2004 at 17:54 UTC
    I'd start by reading the Camel book; even if you only read the first 30 pages or so, you'll have an empowering read that will allow you to start with your first scripts - and it might be enough for the moment if all you want to do is to process a simple text file or so. Or the internet is full of tutorials to learn Perl, see Where and how to start learning Perl or The basics on this site.

    I'd be more precise if I had understood what you were trying to do with that files, but frankly from what you write I don't understand much...