jaypal has asked for the wisdom of the Perl Monks concerning the following question:

Hello Perl Monks, I have a following data set where I would like to print top 4 populated cities for each country. The data is not sorted so I'll have to sort the data first and then grab the top 4 populated cities for each country. The first column is population, second column is country, third column is city and forth being the continent.

20470:ZM:Samfya:Africa 20149:ZM:Sesheke:Africa 18638:ZM:Siavonga:Africa 26459:ZW:Beitbridge:Africa 37423:ZW:Bindura:Africa 699385:ZW:Bulawayo:Africa 47294:ZW:Chegutu:Africa 61739:ZW:Chinhoyi:Africa 18860:ZW:Chipinge:Africa 28205:ZW:Chiredzi:Africa

So my output from the above data set would be:

20470:ZM:Samfya:Africa 20149:ZM:Sesheke:Africa 18638:ZM:Siavonga:Africa 699385:ZW:Bulawayo:Africa 61739:ZW:Chinhoyi:Africa 47294:ZW:Chegutu:Africa 37423:ZW:Bindura:Africa

I was able to write a perl script to get me my desired output. Here is my attempt at the perl script.

#!/usr/local/bin/perl use strict; use warnings; use Data::Dumper; my (%HoA, %lines); while (my $line = <DATA>) { my ($value, $key) = split /:/, $line, 3; push @{$HoA{$key}}, $value; $lines{"$key $value"} = $line # This could be done better } for my $country (keys %HoA) { my @list = sort { $b <=> $a } @{$HoA{$country}}; # This could be +done better for my $ind (0 .. 3) { # This could be done better my $popu = $list[$ind] or next; print $lines{"$country $popu"}; } } __DATA__ 20470:ZM:Samfya:Africa 20149:ZM:Sesheke:Africa 18638:ZM:Siavonga:Africa 26459:ZW:Beitbridge:Africa 37423:ZW:Bindura:Africa 699385:ZW:Bulawayo:Africa 47294:ZW:Chegutu:Africa 61739:ZW:Chinhoyi:Africa 18860:ZW:Chipinge:Africa 28205:ZW:Chiredzi:Africa

My question is based on my attempt to try an write a one-liner equivalent of the above script.

perl -F":" -lane ' BEGIN { $"=":" } push @{$h{$F[1]}}, $F[0]; $line{$F[1],$F[0]} = "@F"; }{ for $k (keys %h) { print @$_ for map [ $line{$k,$_} ], sort { $b <=> $ +a } @{$h{$k}} }' file 20470:ZM:Samfya:Africa 20149:ZM:Sesheke:Africa 18638:ZM:Siavonga:Africa 699385:ZW:Bulawayo:Africa 61739:ZW:Chinhoyi:Africa 47294:ZW:Chegutu:Africa 37423:ZW:Bindura:Africa 28205:ZW:Chiredzi:Africa 26459:ZW:Beitbridge:Africa 18860:ZW:Chipinge:Africa

I am stuck at being able to print the just the top 4 entries using map function. The above just prints out entire file in sorted format.

I was hoping to get some advice from the monks here on both my perl script (for anything I could have done better) as well as solving the question on my one-liner attempt. I have added comments where I felt, I could have written it more idiomatically.

The above data was taken from a question posted on StackOverflow. I know one-liner are not the best way to approach problem but I am still learning perl and feel that writing one liners can help me clear concepts of perl functions. Also probably as I have written a lot of awk one-liners, I feel perl can do this as well.

Looking forward to your comments and suggestions.

Regards
Jaypal

Update: I was able to get the desired output using a splice.

perl -F":" -lane ' BEGIN { $"=":" } push @{$h{$F[1]}}, $F[0]; $line{$F[1],$F[0]} = "@F"; }{ for $k (keys %h) { print $line{$k,$_} for splice [sort { $b <=> $a } @ +{$h{$k}}] , 0, 4 }' file 20470:ZM:Samfya:Africa 20149:ZM:Sesheke:Africa 18638:ZM:Siavonga:Africa 699385:ZW:Bulawayo:Africa 61739:ZW:Chinhoyi:Africa 47294:ZW:Chegutu:Africa 37423:ZW:Bindura:Africa

However would appreciate if anyone can suggest a better approach.

Replies are listed 'Best First'.
Re: Using map function to print few elements of list returned by sort function
by smls (Friar) on May 25, 2014 at 12:13 UTC

    While I agree with boftx that a real script would be more appropriate for something like this in a production setting, I appreciate that pushing one-liners to their limits can make for good learning exercises.

    As for your splice based solution, I think that's actually pretty good. Here are two small suggestions to tweak it further:

    1. You can get rid of the  BEGIN { $"=":" }  by using  $_  instead of  "@F"  to refer to the original line.

    2. You can get rid of the separate  %line  hash by adding that information directly to  %h  whose contents would then look like:
      ( ZM => [ [20470, "20470:ZM:Samfya:Africa"], [20149, "20149:ZM:Sesheke:Africa"], [18638, "18638:ZM:Siavonga:Africa"] ], ZW => [ ... ], ... )

    Here's the one-liner with those changes, in a "scriptified" representation (which I find easier to work with; it's trivial to convert it back to the one-liner format by removing the lines with comments after them):

    use warnings; # just for debugging use strict; # just for debugging my (%h, $k, @F); # just for debugging while (<>) { # -n chomp; # -l $\ = "\n"; # -l @F = split(':'); # -F":" -a push @{$h{ $F[1] }}, [$F[0], $_]; } for $k (sort keys %h) { print $_->[1] for splice [sort {$b->[0]<=>$a->[0]} @{$h{$k}}], 0, 4 } # -n

      Thanks smls. That is indeed a very clever approach. My intension isn't to write the shortest code possible. The one-liner approach was just so that I can have a better understanding of perl built in functions.

      Your approach has just taught me that. Thank you for the detailed explanation.

      Sorry to bug you with a follow up question. Is there a way we can nest two for loops in one line. For eg :

      for $k (sort keys %h) { print $_->[1] for splice [sort {$b->[0]<=>$a->[0]} @{$h{$k}}], 0, 4 }

      could be written something like:

      print $_->[1] for splice [sort {$b->[0]<=>$a->[0]} @{$h{$k}}], 0, 4 fo +r keys %h
        No, but you can chain maps.

        Cheers Rolf

        ( addicted to the Perl Programming Language)

Re: Using map function to print few elements of list returned by sort function
by boftx (Deacon) on May 25, 2014 at 04:06 UTC

    I want to touch on a tangent to what ww said.

    There is no doubt that map (and the various regix tools) can be very powerful. But, that said, it is sometimes better to forgo the temptation to boil an operation down to the minimum lines and instead make it easy for the next person (or yourself) who has read that code 6 months or more down the road.

    This is even more important if you are doing this for a living. It is better to make code clear and easy to read than to try to be "elegant" and spend far more time in the process and cause others to spend more time than needed when it needs to be changed.

    In all honesty, I think you are trying to use a sledgehammer to drive a 4p nail. A simple loop and counter would probably be just as efficient and much easier to understand.

    It helps to remember that the primary goal is to drain the swamp even when you are hip-deep in alligators.
Re: Using map function to print few elements of list returned by sort function
by davido (Cardinal) on May 26, 2014 at 01:48 UTC

    I hope you don't mind deviating from the "one-liner" requirement. You mentioned you're doing this to learn something along the way, so I wanted to make another suggestion, to that end.

    Your current implementation must sort the entire city list for each country, just to retrieve the top four items. When you need the top-n of anything, in sorted order, it's rather unfortunate that the simplest approach is usually to sort the entire list. What you could get away with using is a "partial sort"; one that partitions the input into two parts: a part you want, and a part you don't want. ...and then sorts and returns just the part you want.

    It turns out there's a module on CPAN that does this. It's called, Sort::Key::Top. Its interface is a little complicated to learn at first, but once you do, it works fairly well. Here is an example:

    use Sort::Key::Top 'rnkeytopsort'; my %countries; while( <DATA> ) { my( $country ) = m/:([^:]{2}):/; push @{$countries{$country}}, $_; } print map { rnkeytopsort { /^(\d+):/; $1; } 4 => @{$countries{$_}} } keys %countries; __DATA__ 20470:ZM:Samfya:Africa 20149:ZM:Sesheke:Africa 18638:ZM:Siavonga:Africa 26459:ZW:Beitbridge:Africa 37423:ZW:Bindura:Africa 699385:ZW:Bulawayo:Africa 47294:ZW:Chegutu:Africa 61739:ZW:Chinhoyi:Africa 18860:ZW:Chipinge:Africa 28205:ZW:Chiredzi:Africa

    The way this works is it takes your original data set, and divides it into smaller sets, each set representing a country. Then it does a "top-n" partial sort within each country, and prints out the result.

    I first went looking for a module like this one awhile ago, after using C++'s std::partition and std::partial_sort algorithms in a C++ project I was working on at the time. The concepts are pretty simple, but sometimes it takes seeing them in use somewhere else (in this case in a different language) to "discover" their usefulness.

    Update:

    After preaching about the wasted cycles caused by sorting the entire list of cities just to pick the top four, I went ahead and implemented a version that does just that. Why? It was one of those times where after walking away from the keyboard an idea came along that seemed like it would be fun to explore. Here it is:

    print do { my($c,$n) = ('',0); map { $_->[0] } grep { ($c,$n) = ($_->[2],0) if $_->[2] ne $c; $n++ < 4 } sort { $a->[2] cmp $b->[2] || $b->[1] <=> $a->[1] } map { [ $_, /^(\d+):([^:]{2}):/ ] } <DATA>; }; __DATA__ 20470:ZM:Samfya:Africa 20149:ZM:Sesheke:Africa 18638:ZM:Siavonga:Africa 26459:ZW:Beitbridge:Africa 37423:ZW:Bindura:Africa 699385:ZW:Bulawayo:Africa 47294:ZW:Chegutu:Africa 61739:ZW:Chinhoyi:Africa 18860:ZW:Chipinge:Africa 28205:ZW:Chiredzi:Africa

    Read this one from the bottom up:

    1. Create an anonymous array for each line in the input file. The first element is the line itself, followed by the population, and finally the country code. This is shaping up to look like a Schwartzian Transform.
    2. Sort based on two criteria; first, the country code, and second, the population. The result will be that all cities within a given country are grouped together, in descending order by population. And all countries will be in ascending order by country code. ...still a typical Schwartzian Transform, with a compound sort key.
    3. Grep the sorted list, keeping only the first four cities for each country. By keeping track of the last country seen, and running a counter that we increment on each iteration, but reset whenever a new country code is spotted, we can identify when we've reached the maximum wanted per country. ...this is a deviation from the basic Schwartzian Transform.
    4. Drop the computed keys, and keep only the original lines that survived the 'grep' filter... in sorted order.
    5. The do{...} block just creates a nice compact lexical scope with a return value (the list resulting from the outer map, that we feed into print. I like this because it means the lexical variables I declare are very narrowly scoped.
    6. Print the result.

    It seemed like a cool approach to me, even if it gives back a little efficiency by sorting the entire list. I would probably favor the partition/partial sort strategy posted at the top of my answer though; it's fairly clear what it does, and should be efficient.


    Dave

      Thanks so much Dave for taking out time for this exercise and to provide an excellent explanation.

      I am going to sit and dissect your answers now. :)

Re: Using map function to print few elements of list returned by sort function
by ww (Archbishop) on May 25, 2014 at 01:37 UTC
    Your desired output doesn't seem to match your stated objective, in that it lists cities from just two countries, ZM and ZW. Please clarify.

    Second, where/how do you make the determination of the "top 4 populated countries?" Nothing you've shown makes that clear.

    I, for one, would be a lot more inclined (and able) to help if there were fewer mysteries involved.



    Quis custodiet ipsos custodes. Juvenal, Satires

      Thank you for the comment. My apologies for the confusion. The objective was to print top 4 populated cities for each country. The sample data has 2 countries present ZM and ZW. Each country has many cities listed.

      So I would like to print just the top 4 populated cities for each of them. If a country does not have 4 cities as in the case of ZM then print all lines.

Re: Using map function to print few elements of list returned by sort function
by LanX (Saint) on May 25, 2014 at 17:48 UTC
    AFAICS does your approach only work if population values are unique within a country, since you are using them as hash keys.

    My approach for top 4 would be using slice of a sorted array @sorted[0..3] ... so no need for map.

    Can't easily tell if this was mentioned already... That's the fundamental problem with one liners... ;)

    HTH! :)

    Cheers Rolf

    ( addicted to the Perl Programming Language)

Re: Using map function to print few elements of list returned by sort function
by Laurent_R (Canon) on May 25, 2014 at 15:10 UTC
    I agree with what other monks have said, i.e. that it is probably not a very good idea to make things more complicated by trying to make them as compact as possible. However, for the sake of the exercise, this is possible one-liner (well, really a two-liner) to do what you want:
    perl -F: -nale ' push @d, [@F]; END { print join "\n", map { join ":" +, @$_} grep { $c = $$_[1] eq $prec? $c+1:1; $prec = $$_[1]; $c>4? 0: 1} sort + {$a->[1] cmp $b->[1] || $b->[0] <=> $a->[0]} @d;};' file.txt
    Note that it is also sorting the input by country in the event that the input is not grouped by country. The following is an example execution, piping your reshuffled input data (not grouped by country) into the Perl script: Edit 15:33 UTC: I posted my final result above only a few hours after I started to look for a solution because I had to interrupt my work on it for family obligations. I had not seen your solution using splice, it is better than my grep solution. Overall, I spent probably about 45 minutes to get it (hopefully) right, writing a solution with regular while or for loops would have probably taken less than a third of that time.