in reply to Re^5: What's the 'M-' characters and how to filter/correct them?
in thread What's the 'M-' characters and how to filter/correct them?
Yes you get the point, and thanks for driving my thought out of the swirl.
After thought a bit more, I think what I want is like following:
If they are just Non-English language characters, I would rather keep them.
For the other cases, like the Excel empty character, I think it should be removed as long as the visible content remains unchanged.
So, think my script needs to know what exactly the extended characters really are, from the data file...
By checking the long 8859-1 list, it looks like I either have to list every non-english language characters (in Dec/Hex form) in my regex code, or I have to list all the garbage-like characters...
As this may make the the code hard to maintain, would there be a way to conclude all the useful/non-useful characters in one catagory, in the regex?
The ideal code I'm dreaming about probably looks like following:
# if any non-english language chara in the line if ( $line =~ /[[:non-english_chara_class:]]/g ) { print "Keep the content as is"; } elsif ( $line =~ /[[:garbage_chara_class:]]/g ) { # do some filtering }
Would that be possible?
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^7: What's the 'M-' characters and how to filter/correct them?
by shmem (Chancellor) on Jan 21, 2016 at 15:28 UTC | |
|
Re^7: What's the 'M-' characters and how to filter/correct them?
by Anonymous Monk on Jan 20, 2016 at 22:01 UTC | |
by sylph001 (Sexton) on Jan 21, 2016 at 09:25 UTC | |
|
Re^7: What's the 'M-' characters and how to filter/correct them?
by AnomalousMonk (Archbishop) on Jan 21, 2016 at 20:05 UTC |