http://qs1969.pair.com?node_id=1192943

fredho has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,
I have several files which name has the same root and different suffix (ex: file.001, file.002, file.003) and I need to identify ,for each unique value of root value (file), the file which extension has the highest value (003)
Do I need to push matching file into an array before sorting elements on the extension?
Or is there an easiest way to proceed?
Thanks for the help
  • Comment on Find the last item in a series of files

Replies are listed 'Best First'.
Re: Find the last item in a series of files
by tybalt89 (Monsignor) on Jun 16, 2017 at 15:27 UTC
    #!/usr/bin/perl # http://perlmonks.org/?node_id=1192943 use strict; use warnings; my %names; /(.*)\.(.*)/ and $names{$1}[$2] = $_ while <DATA>; print $names{$_}[-1] for sort keys %names; __DATA__ file.001 file.003 file.002 one.004 two.001 two.003 one.002 one.001 two.002
      Very nice solution!

      A small improvement makes the regex a bit more specific and have it reject filenames that do not match the expected file name template.

      use strict; use warnings; my %names; /^([^.]+)\.(\d{3})$/ and $names{$1}[$2] = $_ while <DATA>; print $names{$_}[-1] for sort keys %names; __DATA__ file.001 file.003 not.good.001 file.002 file.10 one.004 two.001 two.003 two.five one.002 one.001 one.0039 two.002 .005

      CountZero

      A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      My blog: Imperial Deltronics
      Here is my piece of code. I'm not sure this is the best way to do
      chdir ($folder); while ($file = <*>){ my ($ext) = $file =~ /(\.[^.]+)$/; #Check file extension if (($ext =~ m/00./) and ($ext ne ".001")){ next; } elsif ($ext eq ".001") { # first file is required my $filenameroot = $file; $filenameroot =~ s/(.+)\.[^.]+$/$1/; # File name root my @list = glob("$folder$filenameroot*"); print "Last element : $list[(scalar @list-1)]\n"; # Last file + of the series } }
        I see your thought process, this is very close. Since you were nice enough to comply with the "hey, show us what you got" request, I'll make a few comments which I hope will be helpful for you in writing future code...

        • if (($ext =~ m/00./) and ($ext ne ".001")){ The first conditional as part of the "and" is not needed.
          ($ext ne ".001") says it all.
        • elsif ($ext eq ".001") { This "if" is not needed either. $ext has to be equal to ".001" if you get to this point. The previous lines have rejected any value that wasn't equal to "001".
        • Adding comment about:
          my $filenameroot = $file; $filenameroot =~ s/(.+)\.[^.]+$/$1/; # File name root

          This is fine, you make a copy of "$file" by assigning that to a new variable, "$filenameroot" Then you use a substitute operation to modify $filenameroot. This works. However consider:
          (my $filenameroot) = $file =~ m/(.+)\.[^.]+$/;
          In general a substitute operation is more "expensive" than a simple "return a value" operation. That is because the input string must be modified instead of selected parts just being copied. If you put the LHS (Left hand side) into a List context, you can assign $1, and even $2,$3.. from a match. Here $1 gets assigned to $filenameroot - no substitution operation required. This of course also avoids the problem of assigning $filenameroot to something that it is "not quite correct" yet. Here $filenameroot becomes $1.
        • my @list = glob("$folder$filenameroot*"); I am not sure if glob() returns a sorted list or not? Even if it does, it would be Character String sorted and not numerically sorted. This can make a big difference as "13" sorts lower than "3". This sorting difference between Character and Numeric is something to consider when you have numeric values. I don't know for sure whether this is a problem, but always include some double digit numbers in your test cases.
        • The big issue with the glob() is that you are re-reading the directory multiple times. File system operations are "expensive" in terms of CPU. Get in the habit of trying to do a directory read "only once". Store it if you have to in your own data structure. Of course in your application, I don't expect any performance issue, but this is something to be aware of in the future.
        • print "Last element : $list[(scalar @list-1)]\n"; That does indeed get the last element of @list. However there could be a problem because that last element might not be the file with the largest extension number due to previously mentioned potential sorting issues? Note better written as $list[-1]. In Perl the -1 index is the last item, -2 is next to last, etc. A very handy concept. Your code is correct, just mentioning that there is a better syntax for this.
        • I direct your attention to the code by BillKSmith, tybalt89 and CountZero. This is clever in how it works. I think some further explanation may be helpful to you.

          This builds a HoA (Hash of Array) called %names. What is special is that the array @{$names{"name"}} is what is called a "sparse array" - not every element of the array has an assigned value. Perl allows this. If say @array only has 3 things in it, you can still assign $array[14]="Something";. A bunch of values will wind up being "undef" or undefined, but that is just fine. A numeric sort to get the "largest suffix number" is unnecessary, just using the [-1] index is enough. The sort of keys %names just puts the root names in alphabetical order. This has nothing to do with determining the highest numbered suffix. Added: look at Laurent_R's code also.

          I recommend that you use some adaption of the HoA code or Laurent_R's code. Both look great to me.

          Welcome to the group! You will get a lot of help here. In general more help is forthcoming when you demonstrate some effort on your part (which you did).

      okay. i haven't tested this code...but, here's what i got...
      # firstly, i'm gonna use the working directory, for laziness' sake! lo +l # secondly, i haven't thoroughtly tested this. 'sub external_files($$ +)' is tested, and does work according to my tests # i'm working in a windows 10 environment, apache24 and activestate's +perl 5.020002 (i think that version # is right) # # thridly, this script assumes all the files in the folder are named w +ith .xxx where each x is a digit 0..9 # fourth, this will do no error checking! it will work perfect, so lon +g as you adhere to the file extension convention # fifth, and finally, i have not tested this code ############################## # i copied this from a project i'm working on # yes. i use prototypes. SUE me! sub external_files($;$) { #* # lists files within a specified folder (eg: config, txt) # folders will not be included in this list - just the filenames onl +y # if no type is provided, *.* is assumed # type should be just "png" or "txt", no need to include a leading d +ot #* my ($folder, $type) = @_; # a location (eg: users), relative to web +root && a file type if ($type) { # the following is just in case the user of this # subroutine ignores instructions (mainly me lol) $type =~ s/(\*)*//g; # remove stars $type =~ s/(\.)*//; # remove dots $type =~ s/\///g; # remove forward slashes if ($type) { $type = ".$type"; } } if ($folder) { # same idea here as for $type # this one, however, may seem weird, but i've # found it better to account for all possibilities # rather than leave it up to the user of this # code to ensure correct params are given # # besides, i tend to forget to follow my own # instructions, so this saves me tons of head # scratching, see? $folder =~ s/(\/)*$//; # remove trailing /'s $folder =~ s/^(\/)*//; # remove leading /'s $folder =~ s/\/\//\//g; # convert //'s to / $folder .= "/"; # attach trailing /* } my @fixed; my $filespec = $folder . "*" . $type; my @dirs = glob($filespec); $folder =~ s/\./\\./g; $folder =~ s/\//\\\//g; foreach my $dir (@dirs) { if (-f $dir) { $dir =~ s/$folder//; push (@fixed, $dir); } } return @fixed; # an array #usage: my @fileList = external_files("D:/", "txt"); } # end of sub external_files($$); #sub get_last($) { # you could uncomment this line...and turn the foll +owing into a sub! #my ($folder) = @_; # and yes, i do this, too! again, sue me (i belie +ve wholeheartedly, and pedantically so, in the K.I.S.S concept) # my @files = external_files($folder); # i'll leave it up to you to ma +ke sure $folder is a valid location, but give it whatever you like, r +eally my @files = external_files("d:/myNumberedFiles"); # @ files should now contain all yer files stored in d:/myNumberedFile +s/ # now, you want the file with an extension that works out to being the + highest #? # easy! # first, i'm gonna rip through the list, and build a new one. # the new one will contain just the extension with no dots. # leading zeros will be removed from the extension. this should # result in a list with elements that are just numbers. # then, i'm gonna sort the bugger, and pit out the last element. my @exts = (); foreach my $file (@files) { $file =~ s/^(.)*\.(0)*//; # remove everything before and including t +he dot and any leading zeros after the dot # now, pop that into your list push @exts, $file; } # now sort the list! sort @exts; print $exts[$#exts]; #return $exts[$#exts]; #} # and you have yer answer... #you could drop the above "main" code into a sub of it's own, too, of +course. #just uncomment the #sub... line and the line after it, and the #retur +n and #} lines at the bottom

      i hope this one works, and doesn't get too butchered by the rest of the monks here :D i like to think i'm pretty decent at this coding thing, so, go easy on me. i'm 100% self taught, and i have no personal group of PERL programmers in my midst - i'm alone, and i'm a one man band.

      sincerely,

      jamroll
        i haven't tested this code...

        Having a variety of test cases is important. I admit I haven't tested your code myself, but if you had tested it with multiple cases, you might have found that, for example, sort @exts; isn't doing what you want. Also, I can warmly recommend one of the filename manipulation modules like Path::Class, or perhaps File::Spec (a core module) - if you use the former you can even use its methods to list files in the directory (->children). A few more suggestions: Be careful with if ($folder), since that will test negative when $folder happens to be "0" (Truth and Falsehood), you probably want to use length or defined tests instead (same goes for if ($type), of course). Also, I think you might have missed a /g on your "remove dots" regex?

        Update 2019-08-17: Updated the link to "Truth and Falsehood".

Re: Find the last item in a series of files
by 1nickt (Canon) on Jun 16, 2017 at 14:47 UTC

    Hi, what do you have so far? Can you show your code please? An array of file names seems like a good start, after you get the file suffixes.

    You might like Path::Tiny::iter() to find the files, and File::Basename::fileparse() for getting the filename suffix.

    Hope this helps!


    The way forward always starts with a minimal test.
Re: Find the last item in a series of files
by Laurent_R (Canon) on Jun 16, 2017 at 20:28 UTC
    I'm a bit reluctant to sort a whole array if all that is needed is to find a maximum value, because the algorithmic complexity is higher (meaning it is in theory less efficient). Having said that, I must admit it probably does not matter unless then number of file is very high.

    Anyway, you might avoid a sort with something like this:

    my %hash for my $file (glob("*.*")) { my ($root, $ext) = split /\./, $file; if (defined $hash{root) { $hash{$root} = $ext if $ext > $hash{$root}; } else { $hash{$root} = $ext; } }
    The if ... else statement could be reduced to a simple if statement and a Boolean operator:
    $hash{$root} = $ext if (not defined $hash{$root}) or $ext > $hash{$roo +t};
    but I wanted to make it as easy to read as possible.
      A simple split to two parts may not be enough depending upon the OP's filenames and how many '.' characters might be contained within those names. I think that a regex assignment would be more appropriate instead of a split. But it could be that this is all the OP needs.

      I don't like your one line if..else because it is "hard to understand" and confers no execution advantage.

      Anyway, a nice algorithm idea that compares favorably with the HoA ideas previously posted.

        I agree with you. I chose split on the basis of the example provided in the OP (file.001, file.002, ...). A regex would be more robust for more complex filenames.

        I also think that the one-line version of the if ... else statement is less easy to understand (that's why I used the other version in the complete code version), I only wanted to show it could also be made more concise.

        A simple split to two parts may not be enough depending upon the OP's filenames and how many '.' characters might be contained within those names. I think that a regex assignment would be more appropriate instead of a split. But it could be that this is all the OP needs.

        That's why using File::Basename seems a good idea as it provides an easy interface to extract path, files-basename and extension.

        CountZero

        A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

        My blog: Imperial Deltronics
Re: Find the last item in a series of files
by BillKSmith (Monsignor) on Jun 16, 2017 at 15:16 UTC
    You need one hash. In one pass, store basename as key and the largest extension found so far as the corresponding value. At the end of that pass, the hash contains exactly what you want. All you have left to do is format it.
    Bill
A reply falls below the community's threshold of quality. You may see it by logging in.