in reply to Re: Date::Manip and German months names
in thread Date::Manip and German months names

I searched the date::manip source for [a-z] and found a lot of hits.

Good point. I did some more digging myself, and it appears to be a bug in Date::Manip. There is some replacement magic going on to allow the use of "m" in place of "Monat"...

# Check for some special types of dates (next, prev) foreach $from (keys %{ $Lang{$L}{"Repl"} }) { $to=$Lang{$L}{"Repl"}{$from}; s/(^|[^a-z])$from($|[^a-z])/$1$to$2/i; }

As you correctly observed, this is (one place) using the charset [^a-z] to delimit tokens. The net effect of this is that "Mär" ends up as "Monatär" at this stage, which then cannot be parsed properly any further...

Substituting [^a-z\xe4] (for testing purposes) fixes the issue with "Mär", but a proper solution would of course have to dynamically construct the correct character set depending on the language being selected...

I'll submit a bug report.   (Update: done)

For the moment, I can live with just disabling that curious "m" => "Monat" mapping feature as follows — in _Date_Init_German():

... #$$d{"replace"} =["m","Monat"]; $$d{"replace"} =[];

Replies are listed 'Best First'.
Re^3: Date::Manip and German months names (solved)
by ikegami (Patriarch) on Jul 09, 2008 at 22:20 UTC

    a proper solution would of course have to dynamically construct the correct character set depending on the language being selected.

    A simpler solution might be

    foreach $from (keys %{ $Lang{$L}{"Repl"} }) { $to=$Lang{$L}{"Repl"}{$from}; utf8::upgrade($from); # Use Unicode semantics for \b s/\b$from\b/$to/i; }

    He's already assuming $from doesn't contains symbols since he's not using quotemeta, so using \b doesn't introduce any limitations.

    My solution will also make "MÄR" work, unlike the current implementation and your proposed solution.

    Update: Shoot! \w includes digits, so \b won't do. There's a POSIX class that includes just letters that does the trick:

    utf8::upgrade($from); # Use Unicode semantics s/(^|[^[:alpha:]])$from($|[^[:alpha:]])/$1$to$2/i;

    Update: As discovered below, what needs to be upgraded is the string on which s/// acts.

    utf8::upgrade($_); # Use Unicode semantics s/(^|[^[:alpha:]])$from($|[^[:alpha:]])/$1$to$2/i;

      Yes, that looks like a good (simple) solution.  Interestingly though

      s/(^|[^[:alpha:]])$from($|[^[:alpha:]])/$1$to$2/i;

      only works for me when I use locale (which I may not necessarily want to do in this case), while

      s/(^|[^\p{IsAlpha}])$from($|[^\p{IsAlpha}])/$1$to$2/i;

      does work without...

      My solution will also make "MÄR" work

      ...presuming other changes will be made as well — i.e. adding another list of month abbreviations to the definition of $$d{"month_abb"}=...

        Sounds like you forgot to use utf8::upgrade($from);.

        only works when I use locale

        No, using unicode semantics is enough.

        presuming other changes will be made as well

        No, using unicode semantics is enough.

        use HTML::Entities qw( decode_entities ); use locale qw(); my $lc = decode_entities('ä'); my $uc = decode_entities('Ä'); utf8::downgrade($uc); for (0..2) { if ($_ == 0) { utf8::downgrade($lc); locale->unimport(); print("Byte Semantics\n"); print("--------------\n"); } elsif ($_ == 1) { utf8::downgrade($lc); locale->import(); print("Locale Semantics\n"); print("----------------\n"); } elsif ($_ == 2) { utf8::upgrade($lc); locale->unimport(); print("Unicode Semantics\n"); print("-----------------\n"); } if ($lc =~ /^\Q$uc\E\z/) { print("case sensitive match\n"); } elsif ($lc =~ /^\Q$uc\E\z/i) { print("case insensitive match\n"); } else { print("no match\n"); } if ($lc =~ /^[[:alpha:]]\z/) { print("[:alpha:]\n"); } else { print("Not [:alpha:]\n"); } if ($lc =~ /^[\p{IsAlpha}]\z/) { print("\\p{IsAlpha}\n"); } else { print("Not \\p{IsAlpha}\n"); } print("\n"); }
        Byte Semantics -------------- no match Not [:alpha:] \p{IsAlpha} Locale Semantics ---------------- no match Not [:alpha:] \p{IsAlpha} Unicode Semantics ----------------- case insensitive match [:alpha:] \p{IsAlpha}