Dear Monks,

I am having difficulty reading in lines from a txt file that contains text that was copied from a pdf file.

I shortened the text file to 3 lines to take up less space. It looks as follows:

AA 12 BB 34 CC 56

I want to read each line, one line at a time. However, I cannot find a way to do this. I check the ftxt file in a Hex editor and it shows that there is a carriage return at the end of each line. I try to deal with this shown below but it only prints the final line plus some strange stuff goes on and it inserts bits of another line in there somewhere and fails to print the preceeding '>'.

open( FH, $f ) or die; while( my $str = <FH> ){ $str =~ s/\r\n//g; print ">$str<\n"; } close(FH)

# Output:

CCA 56<

If I change s/\r\n//g; to s/\r//g; then it prints everything:

# Output:

>AA 12BB 34CC 56<

I also tried s/[^[:ascii:]]//g; and tr/\x80-\xFF//d; but they do not solve the problem.

Some strange invisible or non-ascii characters from the pdf file are likely the cause of this but I am now stumped as to solve this problem.

Obviously, an answer is "Do not copy text from pdf files!", but I hope someone can help me out with a Perl solution. My work around at the moment is to read the contents of the file into a matrix in R (the language) and then export that matrix to a file, which Perl then has no trouble reading one line at a time.


In reply to Dealing with non-ascii characters when reading file. by rjbioinf

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.