Looks like you've almost got it right. I think the problem is this regexp: $line =~ s/^\D{0,2}|\s{0,2}//;

I'm not sure what you're trying to do there, especially with the \s trimming (are some of these junk characters spaces?) Assuming you actually want to purge control characters (i.e. ascii range 0-31 & friends) and spaces, use the POSIX [:cntrl:] character class, like this (see perlre for more information): $line =~ s/^([[:cntrl:]]|\s){2,}//;

This should delete all control characters and spaces from the beginning of any lines that start with two or more of them. (Unfortunately it will also strip lines with just leading spaces and no control characters, e.g. indented lines -- without seeing the data I don't know if this matters to you.) But why not just forget the {2,} and eliminate any leading control characters? $line =~ s/^([[:cntrl:]]|\s)+//;

If you want to keep leading spaces unless they're also mixed in with control characters: $line =~ s/^([[:cntrl:]]|\s)+// if ($line =~ /^([[:cntrl:]]|\s)+/ && $1 =~ /[[:cntrl:]]/);

I'm not sure if that "clever" trick with the "$1=~" is legit (it syntax checks OK at least); maybe some other monk could clarify this. Unfortunately I don't know what your data looks like, so I can't really test these too well. Hope this helps though.


In reply to Re: cleaning up control characters by blackmateria
in thread cleaning up control characters by skinnymofo

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.