Re: Is there a way to make these two regex lines cleaner?

You could replace the first line with a tr///cd, that should be a bit faster. The second line is the usual way to trim whitespace from a string in Perl so it's fine the way it is.

However, "" is the Byte order mark when the file is encoded in UTF-8 but was opened with the incorrect encoding. So instead of that first regex, you probably want to open the file with open my $fh, '<:raw:encoding(UTF-8)', $filename or die "$filename: $!";, and then do a $line =~ s/\A\N{U+FEFF}//; on the first line of the file. This has the major advantage that any other UTF-8 encoded characters in the file will be decoded correctly - meaning you won't get "strange characters", you'll get the correct Unicode characters, assuming no other encoding issues - and this really is the correct way to solve this issue. If you then still want to turn the text into ASCII-only, see e.g. Text::Unidecode.

Updated: A few edits for clarification. Also: If you have further issues with encoding, I have some brief advice on what to post to get the best answers here.

Comment on Re: Is there a way to make these two regex lines cleaner? Select or Download Code

Replies are listed 'Best First'.
Re^2: Is there a way to make these two regex lines cleaner? by bartender1382 (Beadle) on Apr 16, 2022 at 19:17 UTC
Awesome catch! I am using the use the Spreadsheet::Read module. Which uses the command, `my $book = ReadData ("$upload_dir/$filename");` It will allow me to open the buffer, read it myself, then hand off the buffer to the ReadData command. Sadly that's failing, see below, and will have to debug more. Again, awesome catch! Glad I included the garbage. `open my $fh, '<:raw:encoding(UTF-8)', "$upload_dir/$filename"; read $fh, my $string, -s $fh; close $fh; my $book = ReadData ($string);` [download]	[reply] [d/l] [select]
Re^3: Is there a way to make these two regex lines cleaner? by haukex (Archbishop) on Apr 16, 2022 at 19:30 UTC
I am using the use the Spreadsheet::Read module. That's an important piece of information missing from the root node! I am guessing that your files are CSV files? Because opening any other file type (XLS, XLSX, etc.) with an `'<:raw:encoding(UTF-8)'` will likely corrupt those files, and `ReadData($filename)` should be preferred there. And for CSV files, Spreadsheet::Read uses Text::CSV or Text::CSV_XS under the hood, both of which have a `detect_bom` option when used directly - unfortunately I currently don't see a way to get Spreadsheet::Read to apply that option, so unless Tux has any hints, you could use one of those two CSV modules directly. In any case, you may want to check your `$filename` to see if it's a CSV file first, before handing it off to the processing code appropriate for the file type. Update: Regarding `read $fh, my $string, -s $fh;`, the idiomatic way to slurp a file in Perl is `my $string = do { local $/; <$fh> };` (see $/). Other minor edits. And you need to check your open for errors, see "open" Best Practices.	[reply] [d/l] [select]
Re^3: Is there a way to make these two regex lines cleaner? by swl (Prior) on Apr 16, 2022 at 23:59 UTC
You could also use File::BOM to open the file and then pass the file handle to Spreadsheet::Read. `# untested use File::BOM qw( :all ); use Spreadsheet::Read; open_bom(my $fh, $file, ':utf8'); my $book = ReadData ($fh, parser => "csv");` [download]	[reply] [d/l]