in reply to Re^2: substr function
in thread substr function

Write an ordinary Perl program that filters the text, then measures its length using the built-in substr function.

You're trying to solve an everyday text processing problem in a very peculiar and unconventional way. Writing a custom substr function to measure a particular kind of "string" is like inventing a bathroom scale that knows when your hair is wet and you haven't removed your shoes, yet accurately measures your bone-dry, barefoot weight.

I'm curious: Do you want to measure the lengths of the strings in bytes, in encoded characters (Unicode code points), or in real-world characters (Unicode extended grapheme clusters)?

CLARIFICATION: I admit I sort of conflated substr and length in this post. My excuse is that I was fixated on the words "count" and "one character" in tej's restatement of his Y problem:

Suppose i have string that contains tags like "<bold>" it should not count this tag. If string has something like "<194>" I want substr to consider it as one character.

Replies are listed 'Best First'.
Re^4: substr function
by ikegami (Patriarch) on Jan 13, 2011 at 16:56 UTC
    I suspect <194> refers to a character, and that this notation is used for non-ASCII characters. If so, your last question is moot.

      Huh?

      If <194> is a character entity that represents a "non-ASCII character," then this is precisely something that makes my question germane to the problem of measuring the length of the text in which the character entity occurs.

      For Unicode text, there are at least three valid, meaningful ways to measure the size of the text: in bytes, in characters (code points), and in grapheme clusters.

        Oops, it's makes bytes vs encoded characters moot, but characters vs graphemes is still relevant.

        By the way, there is indeed a fourth: Some characters are double-wide, so you could also talk about visual width.