Text::Autoformat: usage and multibyte-encoded text

Hue-Bond has asked for the wisdom of the Perl Monks concerning the following question:

I've been using a program called par (offsite link) for quite some years now. It's a paragraph reformatter that I use mainly to justify text. When changed my system configuration to use UTF-8 everywhere (keyboard input, terminal output and so), par broke. Last weekend I searched some Perl module to replace it's functionality and, from the several results of my CPAN search, I picked one of which its description souded well: Text::Autoformat, by TheDamian. I'm encountering some problems with it but first, I'll show you the data (I replaced some characters with non-ASCII ones).

Original data

  15:06. Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Fus
+ce
ligula. Curabitur blandit dui ut urna. Nullam vel eros. Mauris rhoncus
+ sem.
Dóńéc luctus velit eget quam. Mauris pellentesque. Vivamus quam. Mauri
+s
ságittis vulputate mauris. Nulla consequat est aliquam urna fringilla
lacinia. Nunc auctor sagittis tortor.
[download]

Actual output

$ perl -MText::Autoformat -e'
undef $/;
print autoformat +<>, {
  right   => 78,
  justify => "full",
  all     => 1,
  lists   => 0,
}' < lorem-ipsum2
  15:06. Lorem ipsum dolor  sit  amet,  consectetuer  adipiscing  elit
+.  Fusce
  ligula. Curabitur blandit dui ut urna. Nullam vel eros. Mauris rhonc
+us  sem.
  Dóńéc luctus velit eget quam. Mauris pellentesque. Vivamus  quam.  M
+auris
  ságittis vulputate mauris.  Nulla  consequat  est  aliquam  urna  fr
+ingilla
  lacinia. Nunc auctor sagittis tortor.
[download]

Desired output

  15:06.  Lorem ipsum  dolor  sit amet,  consectetuer  adipiscing elit
+.  Fusce
ligula. Curabitur  blandit dui ut urna.  Nullam vel eros. Mauris  rhon
+cus sem.
Dóńéc  luctus  velit eget  quam.  Mauris  pellentesque. Vivamus  quam.
+  Mauris
ságittis  vulputate  mauris.  Nulla   consequat  est  aliquam  urna  f
+ringilla
lacinia. Nunc auctor sagittis tortor.
[download]

There are two issues here:

This seems to be a byte-oriented program. Multibyte-encoded characters are treated like bytes, and the net result is that the output lines have 78 bytes, not 78 characters. This is the incorrect behaviour par has too.
Text::Autoformat indents the whole paragraph instead of just the first line. I can understand that this is by design, the problem is that I didn't find an option to prevent it.

Is there anything I can do to sort these out? I suspect the first problem can only be solved by modifying the code but maybe the second is easier.

Update: Tagged the par link as offsite.

--
David Serrano

Comment on Text::Autoformat: usage and multibyte-encoded text Select or Download Code

Replies are listed 'Best First'.
Re: Text::Autoformat: usage and multibyte-encoded text by duckyd (Hermit) on Jul 11, 2006 at 00:35 UTC
try adding `binmode STDOUT, ':utf8';` [download] Without it, I get erroneous output, but the output looks correct to me with it. Note that your input file does not appear to be UTF-8, which may also be causing you some grief.	[reply] [d/l]
Re: Text::Autoformat: usage and multibyte-encoded text by graff (Chancellor) on Jul 11, 2006 at 04:32 UTC
If I store your sample text on my local drive, then convert that text file to utf8 (because what you have posted is iso-8859-1 or equivalent -- i.e. single-byte per accented character), I can solve one of your problems by adding "-CS" to the perl command line: `perl -CS -MText::Autoformat -e'...` [download] That extra option tells perl to set utf8 discipline for both STDIN and STDOUT (the equivalent of doing `binmode ..., ":utf8"` on both file handles). As for the left-margin problem (why is it indenting all the lines), if I delete the initial whitespace from the beginning of the sample data, the indentation goes away completely (including on the first line). That also happens if I add an explicit option for the left margin in the hash of config settings: `left => 1` Apparently, the docs are a bit misleading about what the default behavior is: the actual behavior is that if a string begins with whitespace, the default is to prepend that much whitespace to all the wrapped lines on output. I haven't found anything yet in the man page that talks about indenting only the first line of a paragraph. (update: If your input data is really 8859-1, you can use "-CO" (capital letter o) instead of "-CS", and perl will do the Right Thing. If the input is actually some other non-utf8 encoding, you'll need to use `binmode STDIN, ":encoding(whatever)"` for perl to read it properly, and then still use "-CO" to output utf8.)	[reply] [d/l] [select]
Re^2: Text::Autoformat: usage and multibyte-encoded text by Hue-Bond (Priest) on Jul 11, 2006 at 09:41 UTC
what you have posted is iso-8859-1 or equivalent -- i.e. single-byte per accented character I don't know exactly what I have posted (i.e. didn't run a sniffer to see the actual request from my browser) but I do know that `perlmonks.org` is sending me `Content-Type: text/html, charset=ISO-8859-1`. Unfortunately, that's not my fault. I can assure you that my files are in UTF-8: `$ file lorem-ipsum2 lorem-ipsum2: UTF-8 Unicode text` [download] I can solve one of your problems by adding "-CS" to the perl command line Great! I've never payed much attention to `-C` because I thought that Perl would auto detect that things for me. But it seems I should be adding it to my `bash` alias and use it always on this system. Apparently, the docs are a bit misleading about what the default behavior is I didn't get confused by the docs and was expecting that result before getting it, but it was possible that I was overlooking something in the documentation that allowed me to get the output I wanted. That's why I asked. So if I thought about patching the code for fixing my encoding problem, now I'm thinking to patch to address this indentation issue :^). -- David Serrano	[reply] [d/l]