j.goor has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks, I Try to walk a directory tree to re-tag my mp3 collection.

I have a file (mp3.csv) which contains path;metadata. This file is processed to put the metadata in the corresponding path.

However, when I have my Perl script to go through the file it mostly fails to get to the files and directories as soon as the characters used are UTF8'ish.

Example: a line in the mp3.csv file reads:
/media/usbdisk/music/checkit/Crosby, Stills, Nash & Young/Déjà Vu/01 - Carry On.mp3;Crosby, Stills, Nash & Young;Déjà Vu;Carry On;1;Rock;1970

When I run the script, it says it cannot find the file, thereby displaying:

/media/usbdisk/music/checkit/Crosby, Stills, Nash & Young/D▒j▒ Vu/10 - Everybody I Love You.mp3;Crosby, Stills, Nash & Young;D?j?▒ Vu;Everybody I Love You;10;Rock;1970

As you can see the characters have been messed up, even after I opened the file and manually entered the right characters (of did a copy-paste from the correct directory and filename).

How do I get the script to work properly?

Please monks - help me!

Regards,
John
Ubuntu Hardy, ext3 filesystem, Perl 5.8.8

Replies are listed 'Best First'.
Re: encoding problem om Ubuntu Linux
by moritz (Cardinal) on Jun 20, 2008 at 08:08 UTC
    Perl generally handles UTF-8 and Unicode very well, but there's a limit: file names. Linux doesn't have an encoding-aware API for file name operations, so it's not really perl's fault.

    That being said, the normal approach is to decode the data from the outside world into text strings, work with it, and encode it back to byte strings before you print it or perform operations on the file system.

    However, if file names and input data have the same encoding, everything (except some string operations like substr and regex matches) should work just fine. Which suggests that some of your data or file names have a different encoding than the system default of UTF-8.

    There's a lot to say about it, and I already said much here. There's also perluniintro, the excellent Encode module (it's a core module), and perlunicode.

Re: encoding problem om Ubuntu Linux
by ikegami (Patriarch) on Jun 20, 2008 at 07:55 UTC

    Could you do two things for me.

    • Could you use Devel::Peek on the file name coming out of the CSV. I'd like to find out how your string is encoded and if it's a string of characters.

      use Devel::Peek; Dump($filename);
    • Could you use Devel::Peek on the file name from the disk? I'd like to find out how its encoded on disk.

      use strict; use warnings; use Devel::Peek qw( Dump ); my $dir = '/media/usbdisk/music/checkit/Crosby, Stills, Nash & Young'; opendir(my $dh, $dir) or die; while (defined(my $fn = readdir($dh))) { Dump($fn) if / vu$/; }
      Here are the answers.

      Ad 1:
      SV = PV(0x8154064) at 0x8153594
      REFCNT = 1
      FLAGS = (PADBUSY,PADTMP,POK,READONLY,pPOK)
      PV = 0x81757e8 "mp3_v4.csv"\0
      CUR = 10
      LEN = 12


      Ad 2:
      SV = PV(0x8153ae8) at 0x81535b8
      REFCNT = 1
      FLAGS = (PADBUSY,PADMY,POK,pPOK)
      PV = 0x817dad8 "D\303\251j\303\240 Vu"\0
      CUR = 9
      LEN = 12
      SV = PV(0x8153ae8) at 0x81535b8
      REFCNT = 1
      FLAGS = (PADBUSY,PADMY,POK,pPOK)
      PV = 0x817db88 ".."\0
      CUR = 2
      LEN = 4
      SV = PV(0x8153ae8) at 0x81535b8
      REFCNT = 1
      FLAGS = (PADBUSY,PADMY,POK,pPOK)
      PV = 0x817dad8 "."\0
      CUR = 1
      LEN = 4
        Ad 1: ... PV = 0x81757e8 "mp3_v4.csv"\0

        I think the file name in the CSV file - the one with the "Déjà" - would be more interesting :) — which is what ikegami meant...

        Ad 2: ... PV = 0x817dad8 "D\303\251j\303\240 Vu"\0

        This verifies that the filesystem encoding of that file is UTF-8. So, presumably, the file name in the CSV file isn't in UTF-8.

Re: encoding problem om Ubuntu Linux
by zentara (Cardinal) on Jun 20, 2008 at 13:21 UTC
    See utf8 filenames I ran into a similar problem awhile ago, and graff gave me this sub to fix the problem
    #this decode utf8 routine is used so filenames with extended # ascii characters (unicode) in filenames, will work properly use Encode; opendir my $dh, $path or warn "Error: $!"; my @files = grep !/^\.\.?$/, readdir $dh; closedir $dh; # @files = map{ "$path/".$_ } sort @files; #$_ = decode( 'utf8', $_ ) for ( @files ); @files = map { decode( 'utf8', "$path/".$_ ) } sort @files;

    I'm not really a human, but I play one on earth CandyGram for Mongo