in reply to Date::Manip and German months names

I searched the date::manip source for [a-z] and found a lot of hits.

EDIT: Ignore that. I read more of the code and it knows international chars. Still it is possible that somewhere a regex is faulty

Replies are listed 'Best First'.
Re^2: Date::Manip and German months names (solved)
by almut (Canon) on Jul 09, 2008 at 20:27 UTC
    I searched the date::manip source for [a-z] and found a lot of hits.

    Good point. I did some more digging myself, and it appears to be a bug in Date::Manip. There is some replacement magic going on to allow the use of "m" in place of "Monat"...

    # Check for some special types of dates (next, prev) foreach $from (keys %{ $Lang{$L}{"Repl"} }) { $to=$Lang{$L}{"Repl"}{$from}; s/(^|[^a-z])$from($|[^a-z])/$1$to$2/i; }

    As you correctly observed, this is (one place) using the charset [^a-z] to delimit tokens. The net effect of this is that "Mär" ends up as "Monatär" at this stage, which then cannot be parsed properly any further...

    Substituting [^a-z\xe4] (for testing purposes) fixes the issue with "Mär", but a proper solution would of course have to dynamically construct the correct character set depending on the language being selected...

    I'll submit a bug report.   (Update: done)

    For the moment, I can live with just disabling that curious "m" => "Monat" mapping feature as follows — in _Date_Init_German():

    ... #$$d{"replace"} =["m","Monat"]; $$d{"replace"} =[];

      a proper solution would of course have to dynamically construct the correct character set depending on the language being selected.

      A simpler solution might be

      foreach $from (keys %{ $Lang{$L}{"Repl"} }) { $to=$Lang{$L}{"Repl"}{$from}; utf8::upgrade($from); # Use Unicode semantics for \b s/\b$from\b/$to/i; }

      He's already assuming $from doesn't contains symbols since he's not using quotemeta, so using \b doesn't introduce any limitations.

      My solution will also make "MÄR" work, unlike the current implementation and your proposed solution.

      Update: Shoot! \w includes digits, so \b won't do. There's a POSIX class that includes just letters that does the trick:

      utf8::upgrade($from); # Use Unicode semantics s/(^|[^[:alpha:]])$from($|[^[:alpha:]])/$1$to$2/i;

      Update: As discovered below, what needs to be upgraded is the string on which s/// acts.

      utf8::upgrade($_); # Use Unicode semantics s/(^|[^[:alpha:]])$from($|[^[:alpha:]])/$1$to$2/i;

        Yes, that looks like a good (simple) solution.  Interestingly though

        s/(^|[^[:alpha:]])$from($|[^[:alpha:]])/$1$to$2/i;

        only works for me when I use locale (which I may not necessarily want to do in this case), while

        s/(^|[^\p{IsAlpha}])$from($|[^\p{IsAlpha}])/$1$to$2/i;

        does work without...

        My solution will also make "MÄR" work

        ...presuming other changes will be made as well — i.e. adding another list of month abbreviations to the definition of $$d{"month_abb"}=...