in reply to regexp text parsing issue.

Regular expressions have their uses. SGML parsing is not one of them. You've already found one of those situations where it just simply doesn't work. (also, try embeded tables). It's even worse when you try to deal with badly formatted HTML (and there's a whole lot of it out there, thanks to incorrectly written WYSIWYG editors and 'webmasters' who have no idea what HTML is).

Would you care to explain your reasons for not wanting to use existing parsers, as it's possible that there may be other ways to solve your problem.

(I'd personally try to build a tree, if I knew I was always going to be working with well formed SGML, but you haven't even mentioned why you're trying to do this)

Replies are listed 'Best First'.
Re^2: regexp text parsing issue.
by Anonymous Monk on Mar 19, 2005 at 01:37 UTC
    Would you care to explain your reasons for not wanting to use existing parsers, as it's possible that there may be other ways to solve your problem.
    Sure I really do not want to force users to have to install a third party module just to use the application. If the html parser was part of all perl standard distrobution libraries then that may be a possibility. I am not trying to parse others webpages so I am confident that the html I am trying to parse would be the same every time. The html is generated by my cgi script.

      You're trying to parse something, that you're generating from a CGI? So you have control of what's being generated in the first place... then why are you using HTML (which is difficult to parse)? Generate an alternate output, that can be more easily parsed (or directly used by whatever it is that you're trying to do.)

      This is exactly what SOAP, WDDX, XML, and all those other acronyms are for. (although, they do have some overhead, but you're sure to get your data across cleanly) Here's another simple way to pass data out of your CGI:

      use Data::Dumper; print "Content-type: text/plain\n\n",Dumper($my_data);

      CGIs don't have to generate HTML. XML can be your friend. So can plain text, when used right. (tab delim, CSV, etc)

        Can't do that. The application I am developing for is already mature and a rewrite is not possible. Isn't there anyone who has an idea on how to actually do this instead of suggestion a work around?

        I can't change the format, I can't have third party modules needing to be installed by the user. That is what I am dealing with.