in reply to Re^2: hexdump/od/perl question
in thread hexdump/od/perl question

You're awesome, thanks a ton! $line =~ tr/\375//d; works fine.
So you only prefix octals with a 0, if they are 2 digits. Grr.. perl not doing what I want again!
Could you please show now how it would be done if I wanted to address this in hex notation from the start, ie the tool to use, and how to target it within perl, just so I don't have to revisit this topic later. Thanks again.


Evan Carroll
www.EvanCarroll.com

Replies are listed 'Best First'.
Re^4: hexdump/od/perl question
by jbert (Priest) on Aug 10, 2007 at 18:39 UTC
    Use hexdump -C for more (imho) readable output. If you're an emacs person, check out hexl-mode, for read-write version with a similar layout.

    Also, it looks like the data you are scraping is unicode. &#65533 appears to be this character.

    If you're dealing with unicode data, you may want to be a bit more careful and convert characters rather than stripping them out, but it depends on your environment and application.

      Dude - hexl-mode rocks. I've never seen that before - thanks!
Re^4: hexdump/od/perl question
by ikegami (Patriarch) on Aug 10, 2007 at 18:16 UTC

    0375 = 0xFD, so "\xFD".

    You actually saw it in od, although it was grouped with a dash ("\x2D").

    I can't play with od at the moment, so I don't know if it can display individual hex bytes instead of grouping them into 16-bit words.

Re^4: hexdump/od/perl question
by graff (Chancellor) on Aug 11, 2007 at 15:42 UTC
    Using hex notation for octets and characters is just "better" than octal (or decimal), IMHO -- more consistent, less confusing, easier to understand and keep track of.

    BTW, if your web scraping, etc is really giving you strings that contain � (a.k.a. "\x{fffd}", the unicode "replacement" character), this would be a symptom of something gone wrong, either in what the content provider (web service) is giving you, or else in what you are doing with the data once you get it.

    That character is used when there is a conversion from some non-unicode encoding into unicode (or from one style of unicode to another, e.g. UTF16 to UTF8), but the input data contained a byte (or byte sequence) that is "unmappable" (unknown or invalid) for the stated input encoding.

    Also, it could be worrisome that your various attempts to "visualize" the data yielded just "0xFD". If the input really contained �, I would expect to see either a three-byte utf8 sequence ("\xEF\xBF\xBD"), or a two-byte utf16 sequence ("\xFF\xFD" or "\xFD\xFF", depending on whether the data was big- or little-endian).

    (update: OTOH, if the original data contains just "\xFD", and that's what you see in a hex dump of the original data, then you'll want to know what the content provider "means" by that value -- i.e. what character encoding they are using -- and make sure you interpret/decode it correctly. The "\x{fffd}" could be the result of one of your processes trying to convert "\xFD" to unicode the wrong way.)

      I think I should clarify the confusion pertaining to the &#65533 stuff =). That character was inserted when I copied the verbatim output of the funky char (\375) to pm -- because pm isn't unicode, and my terminal was... Or that's my guess.
      Anyway, my problem with web scraping is because HTML::TreeBuilder encodes   as some funky encoded html_entity thingy, and that always bites me. I often just want to remove them for simplicity rather than decode the entities and fumble with the complexities of the module.


      Evan Carroll
      www.EvanCarroll.com