almut has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to parse and reformat date strings which look like this on input:

11 Mär 08 22:54 CET

The formats may vary slightly, but they all have in common that months are in German.  Date::Manip generally does support parsing dates in several languages, and although this generally works fine, there seems to be a problem with the month "Mär" (March).

Here's some demo code:

use Date::Manip; Date_Init("Language=German", "DateFormat=non-US", "TZ=UTC"); for my $date_in ( "11 Jan 08 22:54 CET", "11 Mär 08 22:54 CET", "11 Mai 08 22:54 CEST", "11 Okt 08 22:54 CEST", "11 Dez 08 22:54 CET", ) { my $date_out = UnixDate( ParseDate($date_in), "%Y-%m-%d_%H:%M:%S" +); printf "%-20s --> %s\n", $date_in, $date_out; }

The output:

11 Jan 08 22:54 CET --> 2008-01-11_21:54:00 11 Mär 08 22:54 CET --> 11 Mai 08 22:54 CEST --> 2008-05-11_20:54:00 11 Okt 08 22:54 CEST --> 2008-10-11_20:54:00 11 Dez 08 22:54 CET --> 2008-12-11_21:54:00

The expected output for 11 Mär 08 22:54 CET would be 2008-03-11_21:54:00

Suspecting some encoding issue, I also tried to supply the input as Unicode (UTF-8) instead of ISO Latin-1. Same result. (Taking a look at the source suggests the module isn't using Unicode. But just in case.)

So what am I doing wrong?

(I'm using Date:Manip v5.54, as currently available from CPAN.)

Replies are listed 'Best First'.
Re: Date::Manip and German months names
by moritz (Cardinal) on Jul 09, 2008 at 20:16 UTC
    IMHO this is a bug in Date::Manip. Passing a decoded text string to a module that works on text should always work. Sadly it doesn't in this case, and there are quite some other modules on cpan that suffer from the being non-Unicode aware (GD::Graph or GD::Text for example).

      You'll find that most XS module don't handle unicode properly. They tend to work with the internal representation of the string without heading the flag that tells them which of the two internal formats are in use.

      That's not as big a problem for Pure Perl code, since Perl automatically converts the internal format when applying on operation on strings with different internal formats (such as when concatenating them).

Re: Date::Manip and German months names
by jethro (Monsignor) on Jul 09, 2008 at 19:59 UTC
    I searched the date::manip source for [a-z] and found a lot of hits.

    EDIT: Ignore that. I read more of the code and it knows international chars. Still it is possible that somewhere a regex is faulty

      I searched the date::manip source for [a-z] and found a lot of hits.

      Good point. I did some more digging myself, and it appears to be a bug in Date::Manip. There is some replacement magic going on to allow the use of "m" in place of "Monat"...

      # Check for some special types of dates (next, prev) foreach $from (keys %{ $Lang{$L}{"Repl"} }) { $to=$Lang{$L}{"Repl"}{$from}; s/(^|[^a-z])$from($|[^a-z])/$1$to$2/i; }

      As you correctly observed, this is (one place) using the charset [^a-z] to delimit tokens. The net effect of this is that "Mär" ends up as "Monatär" at this stage, which then cannot be parsed properly any further...

      Substituting [^a-z\xe4] (for testing purposes) fixes the issue with "Mär", but a proper solution would of course have to dynamically construct the correct character set depending on the language being selected...

      I'll submit a bug report.   (Update: done)

      For the moment, I can live with just disabling that curious "m" => "Monat" mapping feature as follows — in _Date_Init_German():

      ... #$$d{"replace"} =["m","Monat"]; $$d{"replace"} =[];

        a proper solution would of course have to dynamically construct the correct character set depending on the language being selected.

        A simpler solution might be

        foreach $from (keys %{ $Lang{$L}{"Repl"} }) { $to=$Lang{$L}{"Repl"}{$from}; utf8::upgrade($from); # Use Unicode semantics for \b s/\b$from\b/$to/i; }

        He's already assuming $from doesn't contains symbols since he's not using quotemeta, so using \b doesn't introduce any limitations.

        My solution will also make "MÄR" work, unlike the current implementation and your proposed solution.

        Update: Shoot! \w includes digits, so \b won't do. There's a POSIX class that includes just letters that does the trick:

        utf8::upgrade($from); # Use Unicode semantics s/(^|[^[:alpha:]])$from($|[^[:alpha:]])/$1$to$2/i;

        Update: As discovered below, what needs to be upgraded is the string on which s/// acts.

        utf8::upgrade($_); # Use Unicode semantics s/(^|[^[:alpha:]])$from($|[^[:alpha:]])/$1$to$2/i;
Re: Date::Manip and German months names
by EvanCarroll (Chaplain) on Jul 09, 2008 at 19:12 UTC
    $$d{"month_abb"}= [["Jan","Feb","Mar","Apr","Mai","Jun", "Jul","Aug","Sep","Okt","Nov","Dez"], ["J${a}n","Feb","M${a}r","Apr","Mai","Jun", "Jul","Aug","Sep","Okt","Nov","Dez"]];
    Update: Make sure your a thingy is \xe4
    $$hash{"a:"} = "\xe4"; # LATIN SMALL LETTER A WITH DIAERESIS


    Evan Carroll
    I hack for the ladies.
    www.EvanCarroll.com
      The source wants the abbreviation Mar

      Not sure. As I'm reading the source, it should (or is trying to) also support "Mär", as set up by the "M${a}r"$a is defined as

      my(%h)=(); _Char_8Bit(\%h); my($a)=$h{"a:"};

      with the routine _Char_8Bit() mapping "a:" to the char \xe4

      sub _Char_8Bit { my($hash)=@_; ... $$hash{"a:"} = "\xe4"; # LATIN SMALL LETTER A WITH DIAERESIS ... }

      In other words, "M${a}r" corresponds to ISO Latin "Mär". So it should work, I think.

        I think he is just not encoding properly try to use Encode::encode_utf8
        perl -MEncode -Mutf8 -E'say 1 if encode_utf8("ä") eq encode_utf8("\xe4 +");' perl -MEncode -Mutf8 -E'say 1 if "ä" eq encode_utf8("\xe4");'


        Evan Carroll
        I hack for the ladies.
        www.EvanCarroll.com
          A reply falls below the community's threshold of quality. You may see it by logging in.

      I tried using \xE4.
      I tried byte 0xE4 (iso-latin-1).
      I tried bytes 0xC3 0xA4 (utf-8) with "use utf8;".
      I tried bytes 0xC3 0xA4 (utf-8) without "use utf8;" (shouldn't work).
      I tried utf8::upgrade($date_in)
      I tried utf8::downgrade($date_in).

      None worked.

      I'd contact the author. It's obvious that he intends one of those to work.

      Update: Added last two, but they are redundant with earlier tests in this case.

Re: Date::Manip and German months names
by EvanCarroll (Chaplain) on Jul 10, 2008 at 19:05 UTC
    Check out my new patch for Date::Manip
    Has three large components:
    * YAML Translations, making the Manip.pm thousands of lines shorter
    * Failed test for bug 37573
    * New M:I build system for 35728
    
    I didn't include a solution for this problem, I left the implementation up to the author, I just submited the failing test.


    Evan Carroll
    I hack for the ladies.
    www.EvanCarroll.com