Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Sort on Number Embedded in String

by Dru (Hermit)
on Mar 22, 2005 at 19:04 UTC ( [id://441568]=perlquestion: print w/replies, xml ) Need Help??

Dru has asked for the wisdom of the Perl Monks concerning the following question:

Monks,

I have the following data:
fwlog.14Mar2005.gz fwlog.15Mar2005.gz fwlog.16Mar2005.gz fwlog.17Mar2005.gz fwlog.18Mar2005.gz fwlog.19Mar2005.gz fwlog.1Mar2005.gz fwlog.20Mar2005.gz fwlog.21Mar2005.gz fwlog.2Mar2005.gz fwlog.3Mar2005.gz fwlog.4Mar2005.gz fwlog.5Mar2005.gz fwlog.6Mar2005.gz fwlog.7Mar2005.gz fwlog.8Mar2005.gz fwlog.9Mar2005.gz
I want to sort on the date. I've been able to with the following code:
for (@files){ $_ =~ /fwlogsum\.(\d+)\w+/; push (@days, $1); } for (sort{$a <=> $b} @days){ print "fwlog" . $_ . ".Mar2005\n"; }
I'm fairly sure I can combine these two loops into one, but nothing is jumping out at me. I'm thinking map or grep is what I need, but I don't have enough experience with either to figure out how to use them appropriately in this situation.

I appreciate any suggestions.

Thanks,
Dru

Replies are listed 'Best First'.
Re: Sort on Number Embedded in String
by halley (Prior) on Mar 22, 2005 at 19:51 UTC
    Please, besides fixing this immediate problem, please fix your archive filename scheme. Dates in formats like "3Mar2005" are really VERY unhelpful. You can't sort them without lots of extra processing.

    Pick the ISO 8601 Standard Date Format for any dates that must be human-readable and machine-readable. It reduces language problems (what is "September" in French?), and cultural problems (is "01/06/05" in January or June?).

    Example: 2005-03-21.fwlog.gz

    (You can put the date first, or the filename first, but I prefer date-first so that multiple filenames related to a given date will naturally sort together.)

    Then all these things will sort properly. They will sort at the command line, they will sort in your GUI file managers, they will sort in your perl code, all without expensive and error-prone coding.

    --
    [ e d @ h a l l e y . c c ]

      please fix your archive filename scheme

      Sometimes, when dealing with proprietary software, you do not have such a luxury. In the present case, I'd bet my lunch on the fact that the OP is dealing with Checkpoint FW-1 logs. Yes, the naming scheme sucks, but I doubt filing a bug report would do anything because they would probably reply that it would breaking existing code that deals with the current scheme. Renaming them to do local processing is just a hassle. This is exactly the sort of task Perl excels at, making the difficult things easy.

      - another intruder with the mooring in the heart of the Perl

Re: Sort on Number Embedded in String
by friedo (Prior) on Mar 22, 2005 at 19:14 UTC
    There's probably a more elegant way, but this works.

    @files = sort { my ($ad) = ( $a =~ /fwlog\.(\d+)\w+/ ); my ($bd) = ( $b =~ /fwlog\.(\d+)\w+/ ); $ad <=> $bd } @files; print join "\n", @files; OUTPUT: fwlog.1Mar2005.gz fwlog.2Mar2005.gz fwlog.3Mar2005.gz fwlog.4Mar2005.gz fwlog.5Mar2005.gz fwlog.6Mar2005.gz fwlog.7Mar2005.gz fwlog.8Mar2005.gz fwlog.9Mar2005.gz fwlog.14Mar2005.gz fwlog.15Mar2005.gz fwlog.16Mar2005.gz fwlog.17Mar2005.gz fwlog.18Mar2005.gz fwlog.19Mar2005.gz fwlog.20Mar2005.gz fwlog.21Mar2005.gz

    Update: I'm dumbfounded that so many have reccomended the Schwartzian for such a trivial sorting operation. Is the added complexity worth it for sorting such a small amount of data? Neither [id://JediWizard]'s nor [id://Tanktalus]'s code appear to even work correctly. (See Readmore). This is a case of over-zealous premature optimization if I've ever seen it.

    Up-Update:Typos.

      Thanks friedo (and everyone else). I find your solution the easiest to understand. I know you didn't recommend the Schwartzian Transform for this problem, but I must read up on it since I've never even heard of it. It appears to be popular amongst the monks :-).

      This place is great, in less then 30 minutes, I received a half dozen replies.
        The ST is a neat trick, and fun to learn about. In some cases, it is a very good optimization. But you should pay attention to friedo's comments. Often times the naive sort is fast enough that the overhead of the ST is not worth it, and it will almost always be easier to read than an equivalent ST.

      I think in a case like this, it depends on what the code is meant for. If it's just to play around and to find a faster solution for yourself, using the Schwartzian Transform might be a nice idea. But if it's for a production environment, your code (IMHO) might be better suited since it is more clear. In this case, the comparison is cheap enough to warrant wasting more cycles in favour of readability.

      Remember, there will always be a programmer after you who has to read your code. And only in an ideal world is (s)he familiar with advanced Perl techniques.

Re: Sort on Number Embedded in String
by bpphillips (Friar) on Mar 22, 2005 at 19:12 UTC
    A classic case for a Schwartzian Transform
    for(map {$_->[0]} sort {$a->[1] <=> $b->[1]} map {[$_,m/fwlog\.(\d+)/] +} @files){ print $_,"\n"; }
Re: Sort on Number Embedded in String
by JediWizard (Deacon) on Mar 22, 2005 at 19:12 UTC

    You could use a Schwartzian Transform. see http://www.sysarch.com/perl/sort_paper.html

    for (map $_->[1] => sort {$a->[0] <=> $b->[0]} map {m/fwlogsum\.(\d+)\w+/; [$1, $_]} @files ){ print "fwlog" . $_ . ".Mar2005\n"; }

    A truely compassionate attitude towards other does not change, even if they behave negatively or hurt you

    —His Holiness, The Dalai Lama

Re: Sort on Number Embedded in String
by Roy Johnson (Monsignor) on Mar 22, 2005 at 19:15 UTC
    Sort::Naturally

    Or roll your own with a Schwartzian transform:

    print "$_\n" for map {$_->[1]} sort { $a->[0] <=> $b->[0] } map {[/(\d+)/, $_]} @files;
    or make it work a little harder in the sort routine:
    print "$_\n" for sort { my ($ad, $bd) = map /(\d+)/, ($a, $b); $ad <=> $bd } @files;

    Caution: Contents may have been coded under pressure.
Re: Sort on Number Embedded in String
by Tanktalus (Canon) on Mar 22, 2005 at 19:14 UTC

    The trivial sort operation looks like this:

    for (sort { (my $l = $a) =~ /fwlog\.(\d+)/; (my $r = $b) =~ /fwlog\.(\d+)/; $l <=> $r; } @files) { print "$_\n"; }
    Trivial in terms of straight-forward thought and coding time. But that's rather inefficient in CPU time. Better is:
    for (map { $_->[1] } sort { $a->[0] <=> $b->[0] } map { /fwlog\.(\d+)/; [ $1, $_ ] } @files) { print "$_\n"; }
    The key here is to read it backwards. You have @files. You create a mapping of those @files to anonymous arrays where the first element in the array is the number you're sorting on, and the second is the whole filename you started with. You then sort this list of anonymous arrays, comparing on the first element in the array. You then use map to pull out the original filename in the order that sort created. It's a bit convoluted, but once you get your head around it, you'll be impressed by the sheer elegance. I know I was :-)

    Update: Thanks to bmann on catching the typo in the sort - had a 1 where I should have had a 0.

Re: Sort on Number Embedded in String
by RazorbladeBidet (Friar) on Mar 22, 2005 at 19:16 UTC
    Wow, lots of replies, glad I didn't post mine :D

    BUT - you have to make sure you're sorting on the right thing... you won't be if you just cmp on xxMarxxxx. You should grab the year, month and day and convert the month to the numeric equivalent and then format it as YYYYMMDD. Then <=> it

    Update:
    Took me a little bit to get the hang of it, but this works:
    use strict; my @months = qw( Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec ); my %months; my @files = qw ( fwlog.14Mar2005.gz fwlog.15Mar2005.gz fwlog.16Mar2005.gz fwlog.17Mar2005.gz fwlog.18Mar2005.gz fwlog.19Mar2005.gz fwlog.1Mar2005.gz fwlog.20Dec2005.gz fwlog.21Mar2005.gz fwlog.2Mar2005.gz fwlog.3Mar2005.gz fwlog.4Mar2005.gz fwlog.5Jan2006.gz fwlog.6Mar2005.gz fwlog.7Mar2005.gz fwlog.8Mar2005.gz fwlog.9Mar2005.gz ); my $i = 1; $months{$_} = sprintf( "%02d", $i++ ) for @months; print $_, "\n" for map { $_->[0] } sort { $a->[1] <=> $b->[1] or $a->[2] <=> $b->[2] or $a->[3] <=> $b->[3] } map { (split(/\./,$_,3))[1] =~ /^(\d+)([A-Za-z]+)(\d+)$/; [$_, sprintf( "%02d", $3), $months{$2}, $1 ] } @files;
    --------------
    It's sad that a family can be torn apart by such a such a simple thing as a pack of wild dogs
Re: Sort on Number Embedded in String
by Joost (Canon) on Mar 22, 2005 at 19:14 UTC
Re: Sort on Number Embedded in String
by borisz (Canon) on Mar 22, 2005 at 19:14 UTC
    Here is one way:
    print map { $_->[1] } sort {$a->[0] <=> $b->[0] } map { [ /fwlog\.(\d+ +)/, $_ ] } @data;
    Boris
Re: Sort on Number Embedded in String
by gam3 (Curate) on Mar 22, 2005 at 20:39 UTC
    Or to sort by the entire date.
    use strict; my @files = <DATA>; chomp for (@files); my @sorted = map {$_->[4]} sort { $a->[2] <=> $b->[2] || $a->[3] <=> $b->[3] || $a->[0] cmp $b->[0] } map {[/(\d+)([A-Za-z]+)([0-9]+)/, { Jan => 1, Feb => 2, Mar => 3, Apr => 4, May => 5, Jun => 6, Jul => 7, Aug => 8, Sep => 9, Oct => 10, Nov => 11, Dec => 12, }->{$2}, $_]} @files; print $_, "\n", for @sorted; __DATA__ fwlog.14Jan2005.gz fwlog.14Dec2005.gz fwlog.14Dec2004.gz fwlog.14Dec1901.gz fwlog.14Mar2005.gz fwlog.15Mar2005.gz fwlog.16Mar2005.gz fwlog.17Mar2005.gz fwlog.18Mar2005.gz fwlog.19Mar2005.gz fwlog.1Mar2005.gz fwlog.20Mar2005.gz fwlog.21Mar2005.gz fwlog.2Mar2005.gz fwlog.3Mar2005.gz fwlog.4Mar2005.gz fwlog.5Mar2005.gz fwlog.6Mar2005.gz fwlog.7Mar2005.gz fwlog.8Mar2005.gz fwlog.9Mar2005.gz
    -- gam3
    A picture is worth a thousand words, but takes 200K.
Re: Sort on Number Embedded in String
by TilRMan (Friar) on Mar 23, 2005 at 05:06 UTC
    I want to sort on the date.

    Try Date::Parse.

    use Date::Parse qw( str2time ); sub date { my ($file) = @_; $file =~ /fwlog\.(.*)\.gz/ or warn "Unrecognized filename format"; return str2time($1); } my @sorted = sort { date($a) <=> date($b) } @files;
Re: Sort on Number Embedded in String
by tlm (Prior) on Mar 23, 2005 at 13:45 UTC

    With merlyn in the offing it's presumptuous of me to write about the Schwartzian Transform, but here it goes.

    As others have pointed out, you don't need the ST to handle this task, but it is a neat trick that one often finds in Perl code, so it's good to become familiar with it in any case.

    The ST is just an optimization of the sort operation. It often happens that one wants to sort the elements of an array by some property of the elements other than their alphabetic or numeric order. (In your case, this property is a particular substring, or rather, the number represented by a particular substring.) Suppose that the subroutine property computes this property, and let's say, for the sake of this example, that this property happens to be a number. Then, a naive (ascending) sort would be:

    my @sorted = sort { property($a) <=> property($b) } @unsorted;
    This works fine, but it is inefficient, because for every element $x in @unsorted, property($x) is computed in every sort comparison involving $x. That's a lot of redundant computation.

    The insight behind the ST is to perform the computation of the sorting property only once for each element of the array. Instead of sorting the elements of the original array, we sort "transformed" elements, each consisting of an original element (the "payload") and its property (the "key" or "index") packaged in an anonymous array. I.e. we go from sorting this:

    ($foo, $bar, $baz, ...)
    with the sorting function
    { property($a) <=> property($b) }
    to sorting this
    ([$foo, property($foo)], [$bar, property($bar)], [$baz, property($baz) +], ... )
    with the sorting function
    { $a->[1] <=> $b->[1] }
    Now, for each element in the array, instead of computing its property many times, we only have to dereference the transformed element (an array ref), which is almost always a much cheaper operation.

    Once the transformed array is sorted, we recover the desired sorted array by pulling out the original elements from the corresponding transformed elements.

    Here is what this strategy looks like when applied to your problem:

    my @st = map { [ $_, property($_) ] } @files; my @sorted_st = sort { $a->[1] <=> $b->[1] } @st; my @sorted = map { $_->[0] } @sorted_st; sub property { ( $_[0] =~ /fwlog\.(\d+)\w+/ )[ 0 ] };
    The first line generates the transformed array, whose elements each consists of a "payload" (the original element) and its sorting "property". The second line sorts the transformed array. The third line recovers the "payloads" from the sorted transformed array, to produce the desired sorted array. Actually, notice that only the definition of property is truly specific to your problem; the other lines are pretty generic, modulo sort order and whether the sort is numeric or alphabetic.

    That's all there is to it, although there a couple more comments worth making about this. One is that often programmers condense this procedure significantly by "pipelining" the three lines above, and building the property subroutine right into the first call to map. I.e., the above would be condensed to

    my @sorted = map { $_->[0] } sort { $a->[1] <=> $b->[1] } map { [ $_, ( $_[0] =~ /fwlog\.(\d+)\w+/ )[ 0 ] ] } @files;
    The second point is that it often happens that the sorting needs to be done on several properties: to order two objects whose first property is the same, we order by the second property; if the second property is also the same for both, we order by the third property, etc. In such a situation, it is useful to use a property that returns an array of all the sorting properties, and then the sort line of the above would look something like this:
    sort { $a->[1] <=> $b->[1] or $a->[2] cmp $b->[2] or $b->[3] <=> $a->[3] }
    In this particular example, the sorting is ascending/numerical by the first property, ascending/alphabetical by the second property, and descending/numerical by the third property.

    A third point, which is just a repetition of other comments, is that it is not a foregone conclusion that the ST is going to yield an improvement in speed, or an improvement in speed noticeable enough to make the added complexity of the code worthwhile.

    The Guttman-Rosler paper cited in another comment describes the ST on its way to presenting a significantly less straightforward, but more powerful, optimization, the Guttman-Rosler transform. It is nowhere near as handy or common as the ST, but if a sort is your program's bottleneck it's worth trying it out.

    the lowliest monk

    Update: Added "third point".

Re: Sort on Number Embedded in String
by cog (Parson) on Mar 22, 2005 at 19:14 UTC
    Untested :-)

    for (map { $_->[1] } sort { $a->[0] <=> $b->[1] } map { [substr($_,6,2 +), $_] } @files) { print "fwlog" . $_ . ".Mar2005\n"; }

    Update: I didn't notice that some of the lines had only one digit, and not two; hence, my solution with the substr will not work properly. Still, as you can see by the already three similar answers, this (the schwartzian transform) is the right way to go :-)

Re: Sort on Number Embedded in String
by sh1tn (Priest) on Mar 23, 2005 at 10:00 UTC
    ... my $sort = sub { $_[0] =~ /$_[1]/ && $1 }; @data = sort { $sort->($a, qr/(\d+)/) <=> $sort->($b, qr/(\d+)/) } @data; print "@data" ... __END__ STDOUT: fwlog.1Mar2005.gz fwlog.2Mar2005.gz fwlog.3Mar2005.gz fwlog.4Mar2005.gz fwlog.5Mar2005.gz fwlog.6Mar2005.gz fwlog.7Mar2005.gz fwlog.8Mar2005.gz fwlog.9Mar2005.gz fwlog.14Mar2005.gz fwlog.15Mar2005.gz fwlog.16Mar2005.gz fwlog.17Mar2005.gz fwlog.18Mar2005.gz fwlog.19Mar2005.gz fwlog.20Mar2005.gz fwlog.21Mar2005.gz


Re: Sort on Number Embedded in String
by iradik (Novice) on Mar 23, 2005 at 05:51 UTC
    i'm suprised at all the solutions.. anyway, just write a better cmp function for sort.. pass a glob into sort, use Date::Manip if you don't care about efficiency. strip the dates right out of the filename and make sure the date is parse-able and pass it into Date::Manip::ParseDate.. this is like a 0ne liner..
Re: Sort on Number Embedded in String
by jdporter (Paladin) on Mar 23, 2005 at 17:21 UTC
    I'm surprised no one has suggested the GRT yet. This is an ideal case for it.
    my @a = <DATA>; chomp @a; @a = map { substr $_, 10 } sort map { sprintf "%10s%s", /(\d+\w{3}\d{4})/, $_ } @a; print join "\n", @a, ''; __DATA__ fwlog.14Mar2005.gz fwlog.15Mar2005.gz fwlog.16Mar2005.gz fwlog.17Mar2005.gz fwlog.18Mar2005.gz fwlog.19Mar2005.gz fwlog.1Mar2005.gz fwlog.20Mar2005.gz fwlog.21Mar2005.gz fwlog.2Mar2005.gz fwlog.3Mar2005.gz fwlog.4Mar2005.gz fwlog.5Mar2005.gz fwlog.6Mar2005.gz fwlog.7Mar2005.gz fwlog.8Mar2005.gz fwlog.9Mar2005.gz
Re: Sort on Number Embedded in String
by ww (Archbishop) on Mar 23, 2005 at 15:51 UTC
    Possible typo or cut'n'paste error in the snippet (or, some transformation not mentioned), but as shown, the initial match,
    $_ =~ /fwlogsum\.(\d+)\w+/;
    does not match on anything in the data.

    try:
    $_ =~ /fwlog\.(\d+)\w+/;
    instead... and then note that (unless you can guarantee that the data will never contain more than one month) this node is not helpful in the sense of providing a safe approach for production use.

    There are MANY good suggestions re parsing dates above.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://441568]
Approved by RazorbladeBidet
Front-paged by friedo
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (4)
As of 2024-04-19 00:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found