whohasit has asked for the wisdom of the Perl Monks concerning the following question:

I've been stipping meta characters for a while but when looking into large windows based documents, I'm finding lots of non-printing characters.

For example, using cat -vte on a file,
[william@horse ~]# cat -vte character.txt Bill^K^ISmith
I know that ^I is a tab and can be stripped using \t but what is/how can ^K (and other junk like it) be removed?

Replies are listed 'Best First'.
Re: Stripping Meta/Control Key, etc
by GrandFather (Saint) on Apr 03, 2007 at 03:06 UTC

    What sort of documents? Word, Wordpad, Excel, Acrobat pdf's and so on all have their own document format and just running through stripping out characters that you don't understand won't get you the document text. Or at least, it probably will, but will also get you a pile of other stuff that may look like text but is junk. Perhaps repeated parts of the same document, perhaps stuff that happened to be in memory when the document was created, perhaps symbol tables from the application, but in any case junk, and piles of it.

    So, again, what sort of documents and what do you expect to be able to get out of them?


    DWIM is Perl's answer to Gödel
Re: Stripping Meta/Control Key, etc
by parv (Parson) on Apr 03, 2007 at 03:04 UTC
    In perlre, see the characters classes namely, print, ctrl, graph, so strip accordingly. Or, to put it other way, strip everything that does not match that you want to preserve.
      Thanks.. what I was looking for was:

      "\ck" (which worked)

      I had only come across \v which did not work.

      I've been loading the contents of an Excel spread sheet (what was actually supposed to be just plain text), however it contained probably 50 random occurances of the vertical tab within 3000 rows of 50 columns. This character imparted a line break when inspecting DB rows in a terminal but was otherwise, invisible. I considered stripping everything exception a subset, but lots of uni-characters were necessary.

      But, regarding my original question, is there some relationship between the non-printing character formatting used with "cat -vte <file>" and determining the correct character class to use in a regex?

        Read the bit about -e, -t, and -v options.

        For future, (at least in context of things Unix) note that '^' is used to denote 'Ctrl' when it is not '^[' where it is 'Escape'. '^I' denotes a tab (in vim it can be changed to any other character sequence (see ':help listchars')).

        I think you would be better off opening|dumping the file through hexdump(1) like program which can show you the numeric representation of characters. Then, you would note the value of the charcters that you want to replace to be plugged in for substitution.