Hue-Bond has asked for the wisdom of the Perl Monks concerning the following question:

I've been using a program called par (offsite link) for quite some years now. It's a paragraph reformatter that I use mainly to justify text. When changed my system configuration to use UTF-8 everywhere (keyboard input, terminal output and so), par broke. Last weekend I searched some Perl module to replace it's functionality and, from the several results of my CPAN search, I picked one of which its description souded well: Text::Autoformat, by TheDamian. I'm encountering some problems with it but first, I'll show you the data (I replaced some characters with non-ASCII ones).

Original data

15:06. Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Fus +ce ligula. Curabitur blandit dui ut urna. Nullam vel eros. Mauris rhoncus + sem. Dóńéc luctus velit eget quam. Mauris pellentesque. Vivamus quam. Mauri +s ságittis vulputate mauris. Nulla consequat est aliquam urna fringilla lacinia. Nunc auctor sagittis tortor.

Actual output

$ perl -MText::Autoformat -e' undef $/; print autoformat +<>, { right => 78, justify => "full", all => 1, lists => 0, }' < lorem-ipsum2 15:06. Lorem ipsum dolor sit amet, consectetuer adipiscing elit +. Fusce ligula. Curabitur blandit dui ut urna. Nullam vel eros. Mauris rhonc +us sem. Dóńéc luctus velit eget quam. Mauris pellentesque. Vivamus quam. M +auris ságittis vulputate mauris. Nulla consequat est aliquam urna fr +ingilla lacinia. Nunc auctor sagittis tortor.

Desired output

15:06. Lorem ipsum dolor sit amet, consectetuer adipiscing elit +. Fusce ligula. Curabitur blandit dui ut urna. Nullam vel eros. Mauris rhon +cus sem. Dóńéc luctus velit eget quam. Mauris pellentesque. Vivamus quam. + Mauris ságittis vulputate mauris. Nulla consequat est aliquam urna f +ringilla lacinia. Nunc auctor sagittis tortor.

There are two issues here:

Is there anything I can do to sort these out? I suspect the first problem can only be solved by modifying the code but maybe the second is easier.

Update: Tagged the par link as offsite.

--
David Serrano

Replies are listed 'Best First'.
Re: Text::Autoformat: usage and multibyte-encoded text
by duckyd (Hermit) on Jul 11, 2006 at 00:35 UTC
    try adding
    binmode STDOUT, ':utf8';
    Without it, I get erroneous output, but the output *looks* correct to me with it. Note that your input file does not appear to be UTF-8, which may also be causing you some grief.
Re: Text::Autoformat: usage and multibyte-encoded text
by graff (Chancellor) on Jul 11, 2006 at 04:32 UTC
    If I store your sample text on my local drive, then convert that text file to utf8 (because what you have posted is iso-8859-1 or equivalent -- i.e. single-byte per accented character), I can solve one of your problems by adding "-CS" to the perl command line:
    perl -CS -MText::Autoformat -e'...
    That extra option tells perl to set utf8 discipline for both STDIN and STDOUT (the equivalent of doing  binmode ..., ":utf8" on both file handles).

    As for the left-margin problem (why is it indenting all the lines), if I delete the initial whitespace from the beginning of the sample data, the indentation goes away completely (including on the first line). That also happens if I add an explicit option for the left margin in the hash of config settings:  left => 1

    Apparently, the docs are a bit misleading about what the default behavior is: the actual behavior is that if a string begins with whitespace, the default is to prepend that much whitespace to all the wrapped lines on output. I haven't found anything yet in the man page that talks about indenting only the first line of a paragraph.

    (update: If your input data is really 8859-1, you can use "-CO" (capital letter o) instead of "-CS", and perl will do the Right Thing. If the input is actually some other non-utf8 encoding, you'll need to use  binmode STDIN, ":encoding(whatever)" for perl to read it properly, and then still use "-CO" to output utf8.)

      what you have posted is iso-8859-1 or equivalent -- i.e. single-byte per accented character

      I don't know exactly what I have posted (i.e. didn't run a sniffer to see the actual request from my browser) but I do know that perlmonks.org is sending me Content-Type: text/html, charset=ISO-8859-1. Unfortunately, that's not my fault. I can assure you that my files are in UTF-8:

      $ file lorem-ipsum2 lorem-ipsum2: UTF-8 Unicode text
      I can solve one of your problems by adding "-CS" to the perl command line

      Great! I've never payed much attention to -C because I thought that Perl would auto detect that things for me. But it seems I should be adding it to my bash alias and use it always on this system.

      Apparently, the docs are a bit misleading about what the default behavior is

      I didn't get confused by the docs and was expecting that result before getting it, but it was possible that I was overlooking something in the documentation that allowed me to get the output I wanted. That's why I asked. So if I thought about patching the code for fixing my encoding problem, now I'm thinking to patch to address this indentation issue :^).

      --
      David Serrano