Stripping Meta/Control Key, etc

whohasit has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Stripping Meta/Control Key, etc by GrandFather (Saint) on Apr 03, 2007 at 03:06 UTC
What sort of documents? Word, Wordpad, Excel, Acrobat pdf's and so on all have their own document format and just running through stripping out characters that you don't understand won't get you the document text. Or at least, it probably will, but will also get you a pile of other stuff that may look like text but is junk. Perhaps repeated parts of the same document, perhaps stuff that happened to be in memory when the document was created, perhaps symbol tables from the application, but in any case junk, and piles of it. So, again, what sort of documents and what do you expect to be able to get out of them? DWIM is Perl's answer to Gödel	[reply]
Re: Stripping Meta/Control Key, etc by parv (Parson) on Apr 03, 2007 at 03:04 UTC
In perlre, see the characters classes namely, `print`, `ctrl`, `graph`, so strip accordingly. Or, to put it other way, strip everything that does not match that you want to preserve.	[reply] [d/l] [select]
Re^2: Stripping Meta/Control Key, etc by whohasit (Novice) on Apr 03, 2007 at 03:51 UTC
Thanks.. what I was looking for was: "\ck" (which worked) I had only come across \v which did not work. I've been loading the contents of an Excel spread sheet (what was actually supposed to be just plain text), however it contained probably 50 random occurances of the vertical tab within 3000 rows of 50 columns. This character imparted a line break when inspecting DB rows in a terminal but was otherwise, invisible. I considered stripping everything exception a subset, but lots of uni-characters were necessary. But, regarding my original question, is there some relationship between the non-printing character formatting used with "cat -vte <file>" and determining the correct character class to use in a regex?	[reply]
Re^3: Stripping Meta/Control Key, etc by parv (Parson) on Apr 03, 2007 at 08:10 UTC
Read the bit about -e, -t, and -v options. For future, (at least in context of things Unix) note that '^' is used to denote 'Ctrl' when it is not '^[' where it is 'Escape'. '^I' denotes a tab (in vim it can be changed to any other character sequence (see ':help listchars')). I think you would be better off opening\|dumping the file through hexdump(1) like program which can show you the numeric representation of characters. Then, you would note the value of the charcters that you want to replace to be plugged in for substitution.	[reply]