Elijah has asked for the wisdom of the Perl Monks concerning the following question:

I have been messing around with irc connections and parsing data however I have run into a problem that I am not sure regexp can take care of. The problem is that most text on irc channels have text coloring/formatting which comes through the socket as some binary data represented by a heart then the number of the corresponding color, ie 5, 14, 12, etc..

Currently I have the following:

$filename =~ s/.*?[^\w\.\-\_]+\d+//g;

this removes all binary formatting data and any text NOT part of the filename. However if the filename begins with integer values then they get cut off for obvious reasons.

A common string would look something like:
[]14APPS []7Some.filename.here.tar
I want to grab only the filename and remove the rest. I run into issues when the string is something like this:
[]14APPS []750_cent_album.tar

In situations like this the 50 gets stripped off also. Is there a better way of doing this?

Replies are listed 'Best First'.
Re: Remove text formatting from raw irc socket data?
by ikegami (Patriarch) on Sep 25, 2005 at 04:58 UTC

    The heart is character 3. mIRC uses "\{03}$f" and "\{03}$f,$b" for colours, where "$f" and "$b" are numbers from 0 to 15. Given the lack of a terminator, the best you can do is:

    $filename =~ s/\03(?:1[0-5]|[0-9])(?:,(?:1[0-5]|[0-9]))?//g;

    I don't know how mIRC handles leading 0s, or numbers greater than 15, so there may be discrepencies. There are problems in the design of the colour code (it uses neither fixed-length fields nor a terminator), so you may still have problems with filenames starting with digits. There's no way around that.

    Apparently, other clients use similar but incompatible codes for colours.

Re: Remove text formatting from raw irc socket data?
by sgifford (Prior) on Sep 25, 2005 at 04:40 UTC
    A quick google search says the color numbers are 0-15, and can be followed by a comma and a background color. Something like this should match it:
    (?:1[012345]|\d)(?:,(?:1[2345]|\d))?

    Looks like there are some other commands for bold, italic, etc. You might want to make sure your code deals with those properly, too.

    Update: Fix bug ikegami pointed out (oops!).

      That won't match colour 10 or 11 properly.
Re: Remove text formatting from raw irc socket data?
by davidrw (Prior) on Sep 25, 2005 at 03:59 UTC
    maybe match on the possible colors? Something like (5|14|12|7) instead of \d+ ? Note I'm not familiar with the context here though .. is it a limited set? Even it is, there might still be problems .. like if 1 and 12 are valid colors, then any filename starting with a 2 when the color is 1 will be very problematic ...