Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I want to search for matches between these wierd box characters. Every message is delimited by these and there are thousands of messages. How would I create a reg ex to match the text inbetween these boxes? What are these boxes?

Text:
 FAILURE: NOCALLORIG 02/08/07 14:22:18 #890071 


How would I match?
my $match = m/^(*.?)/gS;

Replies are listed 'Best First'.
Re: Searching between wierd box characters.
by graff (Chancellor) on Feb 08, 2007 at 21:32 UTC
    What are those boxes, you ask? That's a very good question, and you'll need a very exact answer to that in order to be able to match them, but there's no way for us to answer it based on what you've told us so far.

    I'm guessing that you are using some specific window application (or maybe a browser?) to view the data, and the boxes stand for characters that this application is unable to display correctly. The problem is, there might be broad assortment of character values that fall into this "undisplayable" category, and if you just keep using this app the same way to look at the data, you'll never be able to figure out what those characters really are.

    That's why many people use one of the various "hex dump" tools on data files, so they can see the actual numeric values of the bytes that make up these undisplayable characters, and figure out what needs to be done once they know what these characters really are (in the binary sense at least, if not in a human-language sense).

    Perl itself can be used easily to create a hex dump of the data -- something like:

    #!/usr/bin/perl $/=undef; # turn on slurp mode for reading $_ = <>; # read all data into $_ (from STDIN or @ARGV file( while (length()) { @s = split //, substr( $_, 0, 16 ); print join " ", map { sprintf "%02x", ord($_) } @s; print "\n"; print join " ", map { (/[ -~]/) ? " $_" : '~~' } @s; print "\n\n"; $_ = substr( $_, 16 ); }
    That will provide the hex codes for all the bytes in your data, along with the ascii characters for the bytes that happen to be in the printable ascii range.

    Another thing you could do is figure out what character encoding is being assumed by your display application (whatever it is), and then see if you can find out what character encoding is represented in your data file. I expect there's a mismatch between those two encodings, and that is why you are seeing those boxes.

    (The problem also would relate to the font that your application is using, because the boxes are the symbol provided by that font to represent code points for which it does not have a displayable character glyph.)

    (update: obviously, almut has shown that you did in fact post enough information in order for someone here to tell you what your box characters are... but please be aware that you might encounter some other piece of data (which you haven't posted here yet) that will also show little boxes when you view it this way -- and it's not guaranteed that every box you see will always be a 0x19 byte value.)

Re: Searching between wierd box characters.
by Fletch (Bishop) on Feb 08, 2007 at 20:48 UTC

    Erm, I see no "weird box characters" in that (which could be an artifact of how you copied the text into your post) so I can't really tell how to match between them. At any rate, many terminal emulators will show control characters or otherwise unprintable characters as boxen. Some ideas to consider:

    • Make sure you're using the correct encoding for the text in question (perhaps it's UTF-8 or the like)
    • Maybe they're control characters (I've seen several formats which use \x1C "FS" as a field separator)
    • Something like the *NIX od utility or a hex editor can help track down exactly what octets you've got
Re: Searching between wierd box characters.
by almut (Canon) on Feb 08, 2007 at 21:20 UTC

    Various programs display little boxes, if the used font doesn't have a glyph for the character in question. In your case it's the control char \x19 (EM, or "End of Medium").

    The regex /^\x19(.*)\x19/s should match, if you match it against a string which holds all the lines, beginning with \x19.

    Update: A little clarification as to "How would I match?". If you write my $match = m/.../, the variable $match would hold whether the regex matched. In case you want to extract the part in between the parentheses, you'd have to write my ($match) = m/.../ to provide list context... (as often, context makes Perl behave differently). Also, both matches assume that the string you match against is in $_. If you want to match against another variable (e.g. $text), you'd have to write my ($match) = $text =~ m/.../.