comment on

Create a hex dump of the problematic file, and post a relevant part of it.

If you are on unix (Linux, BSD, Mac OS X), try the command od -tx1 -c filename.txt:

>find . -name '*.txt' -print -exec od -tx1 -c {} \;
./linux-file.txt
0000000  41  20  73  69  6d  70  6c  65  20  66  69  6c  65  20  67  6
+5
          A       s   i   m   p   l   e       f   i   l   e       g   
+e
0000020  6e  65  72  61  74  65  64  0a  6f  6e  20  4c  69  6e  75  7
+8
          n   e   r   a   t   e   d  \n   o   n       L   i   n   u   
+x
0000040  20  77  69  74  68  20  55  6e  69  78  0a  6c  69  6e  65  2
+0
              w   i   t   h       U   n   i   x  \n   l   i   n   e
0000060  65  6e  64  69  6e  67  73  2e  0a
          e   n   d   i   n   g   s   .  \n
0000071
./windows-file.txt
0000000  41  20  73  69  6d  70  6c  65  20  66  69  6c  65  20  67  6
+5
          A       s   i   m   p   l   e       f   i   l   e       g   
+e
0000020  6e  65  72  61  74  65  64  0d  0a  6f  6e  20  57  69  6e  6
+4
          n   e   r   a   t   e   d  \r  \n   o   n       W   i   n   
+d
0000040  6f  77  73  20  77  69  74  68  20  57  69  6e  64  6f  77  7
+3
          o   w   s       w   i   t   h       W   i   n   d   o   w   
+s
0000060  0d  0a  6c  69  6e  65  20  65  6e  64  69  6e  67  73  2e  0
+d
         \r  \n   l   i   n   e       e   n   d   i   n   g   s   .  \
+r
0000100  0a
         \n
0000101
./mac-file.txt
0000000  41  20  73  69  6d  70  6c  65  20  66  69  6c  65  20  67  6
+5
          A       s   i   m   p   l   e       f   i   l   e       g   
+e
0000020  6e  65  72  61  74  65  64  0d  6f  6e  20  57  69  6e  64  6
+f
          n   e   r   a   t   e   d  \r   o   n       W   i   n   d   
+o
0000040  77  73  20  77  69  74  68  20  4f  6c  64  20  4d  61  63  0
+d
          w   s       w   i   t   h       O   l   d       M   a   c  \
+r
0000060  6c  69  6e  65  20  65  6e  64  69  6e  67  73  2e  0d
          l   i   n   e       e   n   d   i   n   g   s   .  \r
0000076
[download]

A plain ASCII file should not contain any bytes outside the range 0x20 to 0x7E, except for 0x0D and / or 0x0A for newlines. Any other byte value below 0x20 is very fishy, as is 0x7F. Bytes from 0x80 to 0xFF should not appear in ASCII files, they may indicate some other encoding, like UTF-8 and various legacy encodings.

If (nearly) every second byte is 0x00, it is very likely a text file encoded in UTF-16 or UCS-2; if only every fourth byte is not 0x00, the file is probably encoded in UTF-32.

Alexander

--
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

In reply to Re^5: Regular expressions across multiple lines by afoken
in thread Regular expressions across multiple lines by abcd

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.