in reply to Easy Things

I'm not too sure I have a very clear view of what the skillset of the "average programmer" looks like. I'm fairly sure if you took the average skill level of everyone who work with Perl, it would be pretty low because of all the poor CGI programming out there. (I'm also sure it'd be worse without the Monastery, where everyone tries to improve their skills.)

I imagine the spirit of your post could be phrased in something like "What little corner of Perl have you worked hard to master that other Perl programmers might have overlooked?"

So, in the spirit of that question, my work in Corpus Linguistics has forced me to be a fair hand at dealing with data munging. We constantly need to get data from some raw form into something that we can run through a natural language tokenizer, then through a tagger, then do some post-tagging cleanup (parsing), and then export our data to a MySQL database.

I recently rewrote our tokenizer using Parse::RecDescent. The net result is that the tokenization is slower, but it the accuracy and recall are much higher. A lot of that improvement had to do with re-thinking/re-factoring the algorithms I was using to do this.

Another thing that is really necessary and is often overlooked (or neglected) by Perl programmers--especially in the US--is dealing with various encodings that are not ISO-8859-1. It saves us work in the long run if our scripts handle ISO-5559-15 and UTF-8 just as well. So perllocale is one thing I've really had to master.

--
Allolex

Replies are listed 'Best First'.
Re:x2 Easy Things (a plug for Parse::YAPP)
by grinder (Bishop) on Jan 26, 2004 at 16:45 UTC

    The (Perl) programmers I know who do lots of parsing tend to avoid Parse::RecDescent because it's so slow. They all use Parse::YAPP which is must faster. You might want to give that a spin instead. It's meant to be yacc-compatible, so if you're familiar with that already then you're in good company.

    It would be interesting to see a real-world comparison between the two. (hint hint :-)

      Yes, indeed. :) I think both bart and Corion suggested I use Parse::YAPP, so you're in good company. The tokenizer was my project for last week and I've got another project for the next fortnight, but I will definitely give it a burl when I get the chance. After all, it works (and we can leave it running overnight). ;)

      I'm not really from the yacc/bison/(f)lex crowd, but I am familiar with Benchmark, so maybe I'll take a hint and grant your wish.

      --
      Allolex