Ionizor has asked for the wisdom of the Perl Monks concerning the following question:

I have the following piece of code in which I'm trying to split a filename into a prefix, a number, and a suffix:

sub get_split { my $filename = shift; my $switches = shift; my %filesplit; my @digits = $filename =~ /(\d+)/g or die "Error: Could not extract +a number from filename '$filename'.\n"; $filesplit{digit} = $digits[$switches->{numindex}]; ### This won't work for files like 01-file01.html!! FIX!! ($filesplit{prefix}, $filesplit{suffix}) = split (/$filesplit{digit} +/, $filename, 2); ### This won't work for files like 01-file01.html!! FIX!! return \%filesplit; }

Where this breaks down is if I'm trying to slice up a filename where the number is repeated more than once and I'm trying to slice the filename at a specific instance (the instance I'm interested in is recorded in $switches->{numindex}), e.g. I would like to be able to split 01-file-01.html into ('01-file-', '01', '.html'). The code above will always split the filename into ('', '01', '-file-01.html') which is frequently not what I want.

Help is, as always, appreciated.

--
Grant me the wisdom to shut my mouth when I don't know what I'm talking about.

Replies are listed 'Best First'.
Re: Specific instance of a repeated string
by antirice (Priest) on Aug 09, 2003 at 22:00 UTC

    What is switches supposed to contain beyond the particular group of numbers it should split upon? Anyhow, this will be messy:

    #!/usr/bin/perl -wl use Data::Dumper sub get_split { my $filename = shift; my $switches = shift; my %filesplit; my @digits = $filename =~ /(\d+)/g or die "Error: Could not extract +a number from filename '$filename'.\n"; $filesplit{digit} = $digits[$switches->{numindex}]; # how many $filesplit{digit} can we find before this sucker? my $splits = 2 + grep($_ eq $filesplit{digit},@digits[0..$switches-> +{numindex}-1]); my @temp = split (/$filesplit{digit}/, $filename, $splits); $filesplit{suffix} = pop(@temp); $filesplit{prefix} = join $filesplit{digit}, @temp; return \%filesplit; } print Dumper(get_split("01-file01.html",{numindex=>1})); print Dumper(get_split("01-file01and01.html",{numindex=>1})); print Dumper(get_split("02-file01tom34bill01.html",{numindex=>3})); __DATA__ $VAR1 = { 'digit' => '01', 'suffix' => '.html', 'prefix' => '01-file' }; $VAR1 = { 'digit' => '01', 'suffix' => 'and01.html', 'prefix' => '01-file' }; $VAR1 = { 'digit' => '01', 'suffix' => '.html', 'prefix' => '02-file01tom34bill' };

    That is some ugly code. Hope this helps.

    antirice    
    The first rule of Perl club is - use Perl
    The
    ith rule of Perl club is - follow rule i - 1 for i > 1

      Switches contains all the command line switch settings:

      SWITCHES

      All switches are optional.

      The numeric-index switch should only be used when a filename has multiple numbers in it, e.g. 01-file01.html. This switch defaults to -1 which is the last number in the filename. Specifying the index as 1 will force the script to increment the first set of numbers. Specifying the index as 2 will force the script to increment the second set of numbers (which is redundant since the last set of numbers is the default anyway). Again, you get enough rope to hang yourself so don't use an index higher than the number of numbers in the file name.

      The precision switch controls how many zeros are prepended to 'short' numbers, i.e. should the first file be file1.html, file01.html, file001.html, etc. For default values, the script first looks at the precision of min if it's present, then max. If neither value is specified, the script defaults to the precision in the input URI, meaning if you use the filename file23.html you'll get two digits of precision whether you want them or not.

      The reverse switch simply prints out the list of URIs in order from max to min rather than from min to max.

      The verbose switch turns on some basic warnings such as the detected precision and whether or not the min and max values were swapped.

      --
      Grant me the wisdom to shut my mouth when I don't know what I'm talking about.

Re: Specific instance of a repeated string
by graff (Chancellor) on Aug 09, 2003 at 22:29 UTC
    Unless there's some more definite design or convention for defining what the prefix and suffix are supposed to be, this endeavor is going to smell bad. Perhaps your notion of "prefix-number-suffix" began when there were only file names like "file-01.html" -- but now someone has invented file names like "01-file-01.html", and maybe next month, they'll come up with "01-file-01-new-01.html", "file-01-03-01-new.html", and who knows what else.

    Folks here may be able to help more if given a bit more information about your situation.

      The prefix can contain numbers, as can the suffix. The idea is to increment one number to generate a list, e.g.

      01-file-01.html 01-file-02.html 01-file-03.html ...

      Only one number is ever going to form the basis for the list so the other numbers in a filename should be part of either the prefix or the suffix. I get the feeling I'm not explaining this very well...

      --
      Grant me the wisdom to shut my mouth when I don't know what I'm talking about.

Re: Specific instance of a repeated string
by Ionizor (Pilgrim) on Aug 09, 2003 at 22:43 UTC

    I was hoping what I was trying to accomplish would be clear without having to post all the code but apparently it's not. Here is the full code for my script, which is a rewrite of the script in this node to create a sequential list of files.

    --
    Grant me the wisdom to shut my mouth when I don't know what I'm talking about.

      Wow, it's 03:45, so I'm going to have to tickle the back of my throat and spew what I've got so far.   Besides the comments and debug stuff there's really only 8-10 lines of real code in the loop that does the good stuff.   Anyway, pardon the mess, it's an inspiration still steaming from the source...

      The idea I had was getting an RE to create an array of string pieces, alternating number and non-number chunks.   If you could then figure out the array index of the requested number chunk you'd operate on that.   And then collapse the pieces back together into the updated filename string when needed.

      This is a lot just to show an idea, but you might've not seen an RE do something like this before ...
      my @a = $filename =~ m/ (\D+)? (\d+)? /xg;

        Some excellent inspiration. Thank you.

        --
        Grant me the wisdom to shut my mouth when I don't know what I'm talking about.

      Instead of using a regex followed by a split, try using just a regex.
      if ( $filename =~ /(.*)(\d+)\.(.*)$/ ) { $prefix = $1; $digit = $2; $suffix = $3; } else { print "Error etc\n"; }
      You may need to tweak the regex. I am unable to test it from my current location.

        Unfortunately this won't do what I need it to do. This will only work for files that look like foo01.bar. Some of my files look like: foo10bar.baz. This regex also assumes that it's the last number in the filename that I want to operate on which isn't always the case.

        Thanks for the suggestion though, it is appreciated.

        --
        Grant me the wisdom to shut my mouth when I don't know what I'm talking about.

Re: Specific instance of a repeated string
by CombatSquirrel (Hermit) on Aug 10, 2003 at 16:41 UTC
    If you want to be able to change any number in the file name, you would probably like to have it split into chunks. So far my idea; I realize that shenme has already written a piece of code utilizing this, but TIMTOWDI, and so I decided to write another piece of code:
    #!perl -w use strict; for my $name (<DATA>) { chomp $name; my %parts; $name =~ s/(.*)(\.[^.]*)/$1/; $parts{"suffix"} = $2 or die "Not a valid file name: $name"; $parts{"prefix"} = []; $parts{"number"} = []; while ($name =~ s/^(\D*)(\d+)//) { push @{$parts{"prefix"}}, $1; push @{$parts{"number"}}, $2; } $name and die "Invalid file name format. Rest '$name' remained"; print "Filename splits as follows: [" . join("][", map { "(" . $parts{"prefix"}->[$_] . ")(" . $parts{"number"}->[$_] . ")" } 0..@{$parts{"nu +mber"}}-1) . "]<" . $parts{"suffix"} . ">\n"; } __DATA__ 01-html02.html 01-htm23-43.htm 01-file-01.html

    The program first extracts the file suffix (file ending after the dot, I hope that I didn't misunderstand you here) and then loops through the file name, taking (possibly) a prefix and (definitely) a number from it and storing it in anonymous arrays in $parts{"prefix"} and $parts{"number"}.
    If you want to increment the $ith number now, you would just have to write
    ++$parts{"number"}->[$i]; $filename = join('', map { $parts{"prefix"}->[$_] . $parts{"number"}->[$_] } 0..@{$parts{"number"}}-1) . $parts{"suffix"};


    Hope that helped.

      I guess I wasn't clear. The file suffix is the part of the file after the number I'm operating on (for the file foo10bar.baz the suffix would be bar.baz. In most cases the suffix is just the file extension but I'm trying to make the script handle any filename regardless of the position of the digits.

      I've gotten some inspiration from your code though, so thanks! I'll let you know how it turns out.

      --
      Grant me the wisdom to shut my mouth when I don't know what I'm talking about.