antjock has asked for the wisdom of the Perl Monks concerning the following question:

Well,

I'm writing a general purpose script to handle form submissions, and am wondering about cleaning up textareas. My current function to clean up the text is this:

sub clean_data
{
    my @fields = map
    {
        s/\r+/ /g;
        s/\n+/ /g;
        s/\s+/ /g;
        s/\|/-/g;
        s/^\s//;
        s/\s$//;
        $_;
    } @_;
}

This seems to do everything I want, but since I don't really know what I'm doing, there is likely a better way.

I also am looking through the O'Reilly "CGI Programming with Perl" which uses a similar sub to replace returns with \r, tabs with \t, etc. so that the original formatting of the text is 'preserved' essentially. I would like to do this as well (preserve the formatting, as much as possible), but I need to be able to write it out to a pipe delimited file that folks can import into various programs.

I haven't looked at any modules yet, as I wanted to try to figure it out on my own first, but if anyone can recommend a module that would do the work, thats cool.

I would appreciate any pointers.

cheers.

Replies are listed 'Best First'.
Re: Tidying up textarea fields
by Fastolfe (Vicar) on Dec 14, 2000 at 03:40 UTC
    You may be interested in Text::Wrap and/or Text::Autoformat, which attempts to automate the "cleaning up" of text.

    In any event, \s also matches \r and \n ("vertical" whitespace), so your first two lines are unneeded.

Re: Tidying up textarea fields
by jptxs (Curate) on Dec 14, 2000 at 04:07 UTC

    What Fastolfe says above is dead on. I just want to throw in that you should without a doubt be using CGI.pm for collecting and parsing information from a web form - if I am not mistaken the older edition of the oreilly book fails to mention this very improtant module. Once it gives you the data, then you can go about cleaning it up as you say. When you say 'general form processing' it sounds as if you may be reinventing the largest wheel of them all (CGI.pm), which would be a shame and a waste of your time. You may also want to see the homenode of princepawn for some of the other neat HTML form processing engines. Good luck.

    "A man's maturity -- consists in having found again the seriousness on +e had as a child, at play." --Nietzsch +e
Re: Tidying up textarea fields
by dws (Chancellor) on Dec 14, 2000 at 03:02 UTC
    Answers to this depend on what you mean by "general". It looks at first glance as if application-specific translations have crept into your cleanup rules.

    My general-purpose form handling code (yeah, yeah, I should use CGI.pm, but..) does:

            s/\r\n/\n/g;
            s/\r/\n/g;
    Any further translations are done further up the foodchain. In the case of some of my apps, blank lines (e.g., \n\n) between blocks of text are used to delimit paragraphs. Decisions like that are specific enough to warrant leaving them out of general cleanup code. The same goes for collapsing multiple spaces into one.
Re: Tidying up textarea fields
by wardk (Deacon) on Dec 14, 2000 at 04:34 UTC

    In addition to formatting the text, you may want to check the text for cut-n-paste oddities from Office-type apps (like 'bullet' characters).

    There was a node specificaly dedicated to that subject yesterday Stripping funny characters. There are suggestions in that thread that may be beneficial to you in addition to the great suggestions in this thread.

    good luck!

Re: Tidying up textarea fields
by rrwo (Friar) on Dec 14, 2000 at 05:53 UTC

    Lessee... \r and \n are considered whitespace already, so they're covered in \s. You also only care about converting double-whitespace to singles, so something like s/\s{2,}/ /g; is better.

    That's off the top of my head. I'm sure a lot of other cleanups can be done.

    (Actually, I think web browsers do not send \r characters in forms. I could be wrong.)

    Question is: why do you want to remove all newlines?

    There are times when you want to keep them. When people type in a TEXTAREA field, they often care about keeping their paragraphs separate. I usually preserve newlines and when I display the paragragraph I convert things like double-newlines to paragraph breaks.

      (Actually, I think web browsers do not send \r characters in forms. I could be wrong.)

      False. Not all browsers send \r, but some do (most notably Windows-based MSIE (not sure about Mac)). This can be a real pain for first time textarea handlers because (of course) \r isn't the same thing as \n, but it looks the same.

        This is exactly what I was after, or exactly the kind of info I was trying to find out.

        I guess I should have been a little clearer about what my purposes are, and why I'm 'cleaning up' the textarea in the first place.

        Firstly, the cgi is intended to be a general purpose script to process surveys we do on our intranet. The cgi allows you to pass in some parameters to configure it (such as the delimiter to use in the output file, the name of the output file, etc.). The form data is 'encoded' and both written to an output file and sent in an email to the person administering the survey.

        Now this 'encoding' or cleaning up is mainly meant for the textarea, where people tend to wax philosophic and ramble on for many lines. I need to be able to capture that in a flat file (typically pipe delimited, but I tried to make it so that it was flexible/extensible) and keep it in a format that is reasonably sane.

        I don't want to lose the formatting, so that's kind of what I was originally trying to get at, "How do you deal with the data you get back from a textarea from a form?" Do you just write it out, or do you do anything to it to remove all the nefarious tabs, newlines, microsoft spew, weird browser cruft, etc?

        By the way, I am using CGI.pm to bring all the form data, but CGI doesn't really give you anyway to 'clean' it once you've got it. That's what we got Perl for. :->

        I'm starting to mess with CGI::Validate, but that doesn't seem to address this directly, really.

        Anyway, thanks alot for the feedback so far.

        This whole thing has also got me thinking why the textarea on this form is not wrap="virtual"... I hate when the textareas are not wrapped and the lines go on and on and on, but the people who made perlmonks are smart, they had to have done this for a reason...