wyvern has asked for the wisdom of the Perl Monks concerning the following question:

Hey everybody. I do quite a bit in of work in text formatting languages like groff and nroff, and there are some simple organizational tasks that I think should be fairly easy to automate (i.e. - complex page headers/footers, table of contents, etc.), and so for the past month or so I've been toying with the idea of writing a front end for groff in perl. Until about a year ago my primary platform was Pascal, and having written a few lexical scanners in that language I am very familiar (and incidentally quite dependent on) the use of the pos() function. This function returns the index of a character in a string, and is useful for determining the location of whitespace and tokens. Although I have considered doing this in lex or yacc, I think it would be a shot in the arm for my perl skills to sit down and write a good 'ol lexical scanner in this language. I'm about 90% sure that it's just some library function I'm missing, but it would be helpful if someone could point me in the right direction. Thanks.

Replies are listed 'Best First'.
Re: Character Index
by jbert (Priest) on Mar 27, 2000 at 17:50 UTC
    Perl gives you many facilities for processing strings and it is generally (but not always) not idiomatically good to use low-level routines like 'index'.

    Often processing using regular expressions is more suitable.

    For example, code like (not tested, you get the idea):

    # Pull token from line (Not good perl style) $end = index( $line, / /, 0 ); $token = substr( $line, 0, $end ); $line = substr( $line, $end ); # still need to lose whitespace #
    You could instead do...:
    ( $token, $line ) = split( /\s+/, $line, 2 );
    or even:
    ( $token, $line ) = $line =~ /^(\S+)\s+(\S+)$/;
    It perhaps looks a little funnier if you are used to other languages and you are of course free to code how you want. But, IMHO you don't get the warm fuzzy perl feeling unless you are in the idiom.

    Now someone is going to tell me how I should be doing the above in a much more efficient way ;-)

    Hmmm...last thought is that if you don't have to deal with quoting issues (or you do and you are red hot at regexps ;-) you get to do things like:

    @tokens = split( $line, /\s+/ ); # Split all words into the array foreach my $word ( @words ) { # drive state machine }
    And lastly if you do have to deal with quoted whitespace etc (isn't the real world a tough place) you might find what you need in the Text::ParseWords module. (You shouldn't even have to go to CPAN for that...its part of the base install).
Re: Character Index
by chromatic (Archbishop) on Mar 27, 2000 at 07:31 UTC
    How about index?

    Alternately, split may be useful, as are general regexes.