Parsing a Latin-1 Charset Data File

sumeetgrover has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks -

I am an experienced Perl programmer, but will be working with Latin-1 (French) data files soon for the first time, therefore gathering some much needed wisdom.

So I already have a script, which does the following:

1) Takes a list of regular expressions (REs) from a config file, and

2) Runs each RE against the content of a data file.

So far, the data files in Point 2 have been ASCII/English Language files, and I have never worked with non-English files before. My question is below:
1) Is it right to think that I need to make no changes to my parser? That Perl will process Latin1/French data like any other data?
2) OR - I need to do something special to 'enable' my parser to be able to run French REs against French data files?

I would appreciate your advice.

Comment on Parsing a Latin-1 Charset Data File

Replies are listed 'Best First'.
Re: Parsing a Latin-1 Charset Data File by hippo (Archbishop) on Sep 08, 2017 at 09:11 UTC
You should not need to do anything special to process Latin-1 data - it has historically been the default anyway. However, this does not mean that your code will work as you intend. That would depend entirely on your regexen and the parser rules and implementation. If you don't have a test suite for the code you have written, now would be an excellent time to put one together. You can then throw datasets at it in ascii and Latin-1 and confirm that the code produces the output you expect.	[reply]
Re: Parsing a Latin-1 Charset Data File by haukex (Archbishop) on Sep 08, 2017 at 14:58 UTC
When working with the "Latin" encodings it is easy to confuse things. The very common character encodings latin1, latin9, and CP-1252 have a lot of things in common, but also a few significant differences (like the € symbol), the same is true for some French characters, an example follows. Also, the three aforementioned encodings share their lower 128 bytes with both UTF-8 and ASCII, and since your files have only been in ASCII so far, there is a potential for confusion there. And one shouldn't confuse the common name of the character encoding "Latin-1" (ISO/IEC 8859-1) with the Unicode Latin-1 Supplement. Read more... (1406 Bytes) So my first piece of advice is to be certain what your files are encoded with. If you're using a text editor, keep an eye on which encoding it uses, since it's easy to open a latin1 file, choose "Save As", and have the editor default to a different character encoding like UTF-8 or CP-1252 (sometimes labeled just "ANSI"). In addition, because of the similarities in the character sets, the editor can easily misidentify which encoding the file had in the first place! IMO the critical part when working with different encodings is the decoding of bytes to characters and encoding of characters to bytes, so that means when reading/writing files and/or displaying things on the terminal, or, if you're working with a website and/or database with questionable Unicode support, those interfaces. If you get that part right and get your character data properly decoded into a Perl string, then you've won a major battle. So despite that what hippo wrote about Latin1 being the default, personally I would still suggest you explicitly specify the encoding when opening the files, i.e. `open my $fh, '<:encoding(latin1)', $filename or die $!;`. (Note that even though I used the Encode module in the code above for demonstration purposes, if you're just reading files and they are properly encoded, you should never have to touch it, and if you do you might be doing something wrong.) Once you've got your character data correctly into a Perl string, you have to worry less about it - Perl tries to mostly make transparent which internal encoding it uses, and tries its best to let you think about the string as a sequence of Unicode characters (codepoints). Perl's Unicode handling is very good, including in regexes. I would recommend using a recent version of Perl though, since there have been continuous improvements made to Unicode handling (example). (See also the Perl Unicode Tutorial.) This is also the answer to your two questions: If you open the file with the right encoding, then no other changes should be necessary to your code. One exception might be if you have used explicit character ranges like `[a-zA-Z0-9_]` instead of `\w` - the latter should automatically work with Unicode. If you want to write Unicode characters directly in the Perl source code, `use utf8;` and in your editor save the file with the UTF-8 encoding (just stay away from the functions provided in the utf8 module unless you really know what you are doing). Although personally, I tend to write my Perl source in ASCII and use the `\x{....}` and `\N{...}` sequences (for the latter see charnames). If you want to print Unicode strings to your terminal and it supports UTF-8, you can use `use open qw/:std :utf8/;`, although be careful with that pragma because it changes the default encoding for opening files (which is another reason for my above suggestion for always being explicit about specifying the encoding). See also The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)	[reply] [d/l] [select]
Re^2: Parsing a Latin-1 Charset Data File by Laurent_R (Canon) on Sep 08, 2017 at 20:16 UTC
Great post! Thank you very much, ++haukex, for this. I'll definitely spend some time testing the leads you provide. Thanks again.	[reply]
Re: Parsing a Latin-1 Charset Data File by sumeetgrover (Monk) on Sep 11, 2017 at 08:03 UTC
Thanks a lot everyone for the very useful information you shared.	[reply]