Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Stripping funny characters

by electronicMacks (Beadle)
on Dec 12, 2000 at 05:47 UTC ( [id://46201]=perlquestion: print w/replies, xml ) Need Help??

electronicMacks has asked for the wisdom of the Perl Monks concerning the following question:

Fellow Monks:

I’m writing a CGI that includes a large <textarea> field as an input. I need the final output of the script to be simple clean text, but some of the users are going to be pasting their input from MS Word, bringing unholy characters into play.

I’m not concerned about really weird symbols, like the smiley face, but I am concerned about “smart quotes” for example. (Quotes that know if they're begining or ending)

Is there a Perl library, or Unix command, that will strip the funny characters away and replace “smart quotes” with "simple quotes"?

Replies are listed 'Best First'.
Re: Stripping funny characters
by extremely (Priest) on Dec 12, 2000 at 06:03 UTC
    As I recall, the smartquotes are: \x91-\x94. you should be able to do $text=~tr/\x91-\x94/''""/; and clean it up nicely; If they get pasted to "&#145;" or such html nasties you'll need a different approach, of course.

    --
    $you = new YOU;
    honk() if $you->love(perl)

Re: Stripping funny characters
by chromatic (Archbishop) on Dec 12, 2000 at 10:35 UTC
    The amusingly-named Demoronizer is a Perl program that cleans up yucky stuff like that.

    I haven't looked at it, but I've heard good things about it. Of course, it claims to use Perl 4 features, so perhaps you'll be the one to bring it into the 21st century. (By the time the 21st century rolls around, that is.)

      Ah, I'd had forgotten that little gem existed. I've used it before. This quote from the page you link says it all: (Rule of thumb--every time Microsoft use the word "smart," be on the lookout for something dumb).

      The code needs serious work, OTOH. It is not only perl4ish but also full of goofiness like this:

      # Eliminate idiot MS-DOS carriage returns from line terminator $s =~ s/\s+$//; $s .= "\n";

      All in all it isn't too bad, I'd use it as is, myself =)

      --
      $you = new YOU;
      honk() if $you->love(perl)

Re: Stripping funny characters
by wardk (Deacon) on Dec 12, 2000 at 20:39 UTC

    I can really relate to this...I have 4 textarea's where this occurs daily. The Oracle database can handle the special chars (how many characters CAN they use for a "bullet"?), but my particular data needs to be forwarded to some job search engines, which do not take kindly to these chars...

    even if you do not use the demoroniser itself, it's got all the transformations you should need to eradicate this big pain-in-the-ascii.

    happy hunting!

    ps: anyone else out there get co-workers, friends rolling on the floor laughing by explaining the whole "de-moron-izer" thing??

Re: Stripping funny characters
by kilinrax (Deacon) on Dec 12, 2000 at 18:41 UTC
    Have a look at Catdoc, a GPL program which reads Microsoft Word files and outputs plain text.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://46201]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (4)
As of 2024-04-18 18:50 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found