f77coder has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I'm attempting to parse this monster of a hosts file that is a most un-formatted file with ip4 and ip6 address with comments scattered everywhere. sometimes there are two columns separated by space, sometimes 3 columns, sometimes 4.

127.0.0.1 c2.gostats.com #SpySweeper.Spy.Cookie

127.0.0.1 ads.goyk.com

# 1-800-hostingAS3321069.41.160.0 - 69.41.191.255

127.0.0.1 2a02:598:2::1095

so i want to clean the old file by removing comments, and duplicates, so far i have
my @array = (); $#array = -1; my @tmp_array = (); $#tmp_array = -1; my @uniq = (); $#uniq = -1; my $i = 0; open(HOST_ORIG,'<', $file_read ) or die "Can't open $file_read: $!"; chomp(@array = <HOST_ORIG>); foreach $i (4...scalar(@array)-1) { (my $local_127, $tmp_array[$i] )=split(" ",$array[$i]); }; close(HOST_ORIG); my %seen; my @uniq = grep {! $seen{$_}++} sort(@tmp_array); open(TEMP, '>', $file_write)|| die "\n error opening file $file_write +\n"; print TEMP "#Hosts file\n"; print TEMP "#Last Modified -> ". localtime() . "\n"; print TEMP "# \n"; print TEMP "# localhost: Needs to stay like this to work\n"; print TEMP "127.0.0.1\t localhost\n"; print TEMP "# \n"; foreach $i (1...scalar(@uniq)-1) { print TEMP "127.0.0.1\t $uniq[$i]\n"; } close(TEMP);

it works except when there are 3 or more columns, the 3rd and 4rth columns get wrapped around to a new line like this

127.0.0.1 c2.gostats.com

#SpySweeper.Spy.Cookie

how do i throw away the rest of the line if it exists?

Thanks!

Replies are listed 'Best First'.
Re: parsing a terrible /etc/hosts
by kcott (Archbishop) on Mar 24, 2017 at 08:25 UTC

    G'day f77coder,

    [Please put your data, as well as your code, within <code>...</code> tags. Parts of your data have been rendered as links due to the presence of square brackets. See "Writeup Formatting Tips" for more on this.]

    "so i want to clean the old file by removing comments, and duplicates, so far i have"

    Here's some comments on what you have "so far":

    • You mostly seem to be on the right track with %seen, split and grep.
    • I think you may have become bogged down with too many arrays; reading /etc/hosts into an array was probably a mistake (line-by-line would have been better); and your use of '...', instead of '..', suggests rereading perlop: Range Operators would be useful for you.
    • I'd strongly recommend lexical variables, in limited scope, for filehandles: using globally-scoped, package variables, with nondescript names such as TEMP, is likely cause problems in anything but the most trivial scripts (see open).
    • Note that sort, by default, uses a string comparison. It will order '.' before digits; digits before ':'; and ':' before letters. Numbers starting with 1 (e.g. 1,000,000), will be ordered before numbers starting with 2 (e.g. 2).

    Here's how I might have approached this. I've added some additional test data.

    #!/usr/bin/env perl -l use strict; use warnings; my (%data, %seen, @order); while (<DATA>) { s/\s*#.*$//; next if /^\s*$/; my ($key, @rest) = split; push @order, $key unless $seen{$key}++; push @{$data{$key}}, grep { ! $seen{$_}++ } @rest; } print join ' ', $_, @{$data{$_}} for @order; __DATA__ 127.0.0.1 c2.gostats.com #SpySweeper.Spy.Cookie 127.0.0.1 ads.goyk.com # 1-800-hostingAS3321069.41.160.0 - 69.41.191.255 127.0.0.1 2a02:598:2::1095 127.0.0.2 2a02:598:2::1096 127.0.0.2 2a02:598:2::1096 127.0.0.2 2a02:598:2::1096 127.0.0.2

    Output:

    127.0.0.1 c2.gostats.com ads.goyk.com 2a02:598:2::1095 127.0.0.2 2a02:598:2::1096

    — Ken

      Hi Ken, Thanks for the help and critic.
Re: parsing a terrible /etc/hosts
by huck (Prior) on Mar 24, 2017 at 06:49 UTC

    Whew, im not sure where to begin.

    https://en.wikipedia.org/wiki/Hosts_(file)
    The hosts file contains lines of text consisting of an IP address in the first text field followed by one or more host names. Each field is separated by white space – tabs are often preferred for historical reasons, but spaces are also used. Comment lines may be included; they are indicated by a hash character (#) in the first position of such lines. Entirely blank lines in the file are ignored.

    First there is no rule that the first four lines of the hosts file will be comments

    hosts files may have blank lines

    Any line may have a comment, any text after the # is taken to be a comment. A comment may make the line otherwise blank if there is no ip/names before it.

    Not every ip in the hosts file HAS to be 127.0.0.1. Mine has lines like

    192.168.2.1 wifi.mylan
    So i can access things on my local net by name.

    notice followed by one or more host names. Multiple names are allowed on one line for the same ip address.

    I think this will correctly read a hosts file, and do what you are after.

    use strict; use warnings; my $file_read='C:/WINDOWS/system32/drivers/etc/hosts'; my $ha=[]; my $names={}; my $ips={}; #open (my $hf,'<',$file_read) or die "Can't open $file_read: $!";; my $hf=\*DATA; while (my $line=<$hf>) { chomp $line; # print $line."\n"; my $h={}; $h->{line}=$line; my ($pre,$comment)=split('#',$line,2); $h->{comment}=$comment if ($comment); if ($pre) { my @parts=split(/\s+/,$pre); if (scalar(@parts)>1) { my $ip=shift @parts; $h->{ip}=$ip; push @{$ips->{$ip}},@parts; $h->{names}=[@parts]; for my $name (@parts) {$names->{$name}=$ip; } } # parts } # pre push @$ha,$h; } #line use Data::Dumper; print Dumper($ha); print Dumper($ips); print Dumper($names); #open($out, '>', $file_write)|| die "\n error opening file $file_write + \n"; my $out=\*STDOUT; print $out "#Hosts file\n"; print $out "#Last Modified -> ". localtime() . "\n"; print $out "# \n"; print $out "# localhost: Needs to stay like this to work\n"; print $out "127.0.0.1\t localhost\n"; print $out "# \n"; delete $names->{localhost} if ($names->{localhost}); my @ksort=sort {my $r1=$names->{$a} cmp $names->{$b}; return $r1 if($r +1); $a cmp $b} keys(%$names); for my $name (@ksort) { print $out $names->{$name}."\t".$name."\n"; } __DATA__ # Copyright (c) 1993-1999 Microsoft Corp. # # This is a sample HOSTS file used by Microsoft TCP/IP for Windows. # # This file contains the mappings of IP addresses to host names. Each # entry should be kept on an individual line. The IP address should # be placed in the first column followed by the corresponding host nam +e. # The IP address and the host name should be separated by at least one # space. # # Additionally, comments (such as these) may be inserted on individual # lines or following the machine name denoted by a '#' symbol. # # For example: # # 102.54.94.97 rhino.acme.com # source server # 38.25.63.10 x.acme.com # x client host 127.0.0.1 localhost ads.pointroll.com scanner2.malware-scan.com +localhost adsys.townnews.com adimages.townnews.com ad.doubleclick.net + pagead2.googlesyndication.com ad.yieldmanager.com view.atdmt.com ads +.revsci.net servedby.advertising.com jeffcity30.autochooser.com perfo +rmanceoptimizer.com cache.fimservecdn.com pixel.quantserve.com ads.yi +mg.com this.content.served.by.adshuffle.com img-cdn.mediaplex.com cac +he.fimservecdn.com adserving.cpxinteractive.com pixel.quantserve.com +s0.2mdn.net 127.0.0.1 www.zip2save.com d1.openx.org c3.openx.org partner.goo +gleadservices.com media.ljworld.com everythingmidmo.com www.everythin +gmidmo.com edge.quantserve.com pixel.quantserve.com ad-g.doubleclick. +net ads.yimg.com ad.wsod.com s0.2mdn.net s0.2mdn.net 192.168.1.1 nat.mylan 192.168.1.100 dhcp1.nat.mylan 192.168.2.1 wifi.mylan 192.168.2.100 dhcp1.wifi.mylan 192.168.1.234 lxle0 lxle0.mylan 192.168.1.200 me me.mylan 192.168.1.200 me me.mylan 192.168.254.251 wan.mylan
    last part of result
    #Hosts file #Last Modified -> Fri Mar 24 01:39:59 2017 # # localhost: Needs to stay like this to work 127.0.0.1 localhost # 127.0.0.1 ad-g.doubleclick.net 127.0.0.1 ad.doubleclick.net 127.0.0.1 ad.wsod.com 127.0.0.1 ad.yieldmanager.com 127.0.0.1 adimages.townnews.com 127.0.0.1 ads.pointroll.com 127.0.0.1 ads.revsci.net 127.0.0.1 ads.yimg.com 127.0.0.1 adserving.cpxinteractive.com 127.0.0.1 adsys.townnews.com 127.0.0.1 c3.openx.org 127.0.0.1 cache.fimservecdn.com 127.0.0.1 d1.openx.org 127.0.0.1 edge.quantserve.com 127.0.0.1 everythingmidmo.com 127.0.0.1 img-cdn.mediaplex.com 127.0.0.1 jeffcity30.autochooser.com 127.0.0.1 media.ljworld.com 127.0.0.1 pagead2.googlesyndication.com 127.0.0.1 partner.googleadservices.com 127.0.0.1 performanceoptimizer.com 127.0.0.1 pixel.quantserve.com 127.0.0.1 s0.2mdn.net 127.0.0.1 scanner2.malware-scan.com 127.0.0.1 servedby.advertising.com 127.0.0.1 this.content.served.by.adshuffle.com 127.0.0.1 view.atdmt.com 127.0.0.1 www.everythingmidmo.com 127.0.0.1 www.zip2save.com 192.168.1.1 nat.mylan 192.168.1.100 dhcp1.nat.mylan 192.168.1.200 me 192.168.1.200 me.mylan 192.168.1.234 lxle0 192.168.1.234 lxle0.mylan 192.168.2.1 wifi.mylan 192.168.2.100 dhcp1.wifi.mylan 192.168.254.251 wan.mylan
    YMMV

      Not every ip in the hosts file HAS to be 127.0.0.1.

      And not every ip in the hosts file has to be an IPv4 address. It has become quite common to see something like "::1 this-machine.lan this-machine" in /etc/hosts - IPv6.

      Parsing any hosts file is easy: Read line by line, strip comments (s/#.*//), remove trailing and leading whitespace (s/^\s+//; s/\s+$//;), skip empty lines (length or next), split at whitespace (@tmp=split /\s+/). Splitting must return at least two elements. First element must match an IPv4 or IPv6 address (see Regexp::Common::net), all other elements must be valid host names (again, see Regexp::Common::net).

      Alexander

      --
      Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
      Cpan? Let me try hosts ->

      Config::Hosts Interface to /etc/hosts file

      Parse::Hosts Parse /etc/hosts

      App::ParseHosts Parse /etc/hosts (CLI)

      Looks promising

        Config::Hosts Interface to /etc/hosts file

        if ($hosts->{$ip}) { print STDERR "Line $l: Warning: duplicate IP entry $ip, the last one + will be used\n"; }
        As far as i remember(AIX, ubuntu, win) you may have duplicate lines with the same ip and they are "joined".

        Also "output" mixes up names and ips in same array. While strange, this is a valid line
        127.0.0.1 192.168.0.1
        it makes the host-name 192.168.0.1 map to the ip of 127.0.0.1, common use is by script kiddies tho.

        Parse::Hosts Parse /etc/hosts

        unless (defined $content) { open my($fh), "<", "/etc/hosts" or return [500, "Can't read /etc/hosts: $!"]; local $/; $content = <$fh>; }
        only reads /etc/hosts or you have to read it yourself and pass it in. "output" is an array of hashs {ip => $ip, hosts => \@hosts}

        App::ParseHosts Parse /etc/hosts (CLI) Just a wrapper around Parse::Hosts to allow you to read anyfile

        Mine parses the lines same way as them, and has more usable data structures for the task and example. Mine doesnt check for valid ipv4 or ipv6 format tho like Config::Hosts almost does. I looked at Config::Hosts first and didnt like it

        add: and none are core either

        @Anonymous-Monk

        Thanks for the links.

        Well, my first inclination is to write code not search, I like to know what 'stuff' is doing. To each their own. I've been burned before by relying on modules.

      Huck,

      Thank you for the help. I threw a big string of IP6 at it

      127.0.0.1 2606:f180:1:2e8:2e8:1fbd:8257:d7c1 2600:3c03::f03c:91ff:fee5:3474 2a02:a450:9137:1:c8c:974c:2b65:7fec 2a03:4a80:2:2d6:2d6:853e:e533:bfa7 2600:3c03::f03c:91ff:fee5:3474

      and no problems.

Re: parsing a terrible /etc/hosts
by hdb (Monsignor) on Mar 24, 2017 at 07:28 UTC

    On ignoring comments: one way would be to bin them altogether. If $line contains your line from the hosts file, you could

    $line =~ s/#.*//;

    remove the hash and any character following it.