Parsing SGML-ish Data Files

coolmichael has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks, it's been a while (only seven years!).

I'm working on a project to parse, validate, and transform some large data files which look like SGML tags, but most definitely are not. For example, <FOO=4> is perfectly valid and well defined, and the tags do not have to nest properly as they do in SGML and XML. I've tried the SGML:: tools on CPAN, but they don't quite work. I've also tried HTML::Parser, but it chokes on attributes which use "smart quotes" (0x201D in Unicode).

I've written a pure perl finite state machine parser (and test suite) which creates a data structure I can validate, but it is very slow. Like 45 seconds on a 900Kb file. The bottleneck is the parsing phase, so I'd like to speed that up somehow.

I've squeezed as much performance out of it as I can with Devel::NYTProf, but I think if I want to get it down to 10 seconds a file I need to rewrite the parser some how. I could go the C/XS route for it, but that would be a massive learning curve.

I haven't tried Parse::RecDescent yet or Parse::Yapp. What are your thoughts on them?

If you were writing a parser for something SGMLish (but not SGML), where would you start?

Comment on Parsing SGML-ish Data Files Download Code

Replies are listed 'Best First'.
Re: Parsing SGML-ish Data Files by mirod (Canon) on Aug 02, 2012 at 06:47 UTC
It is pretty difficult to answer you with the litlle information you give us, but I'll try anyway ;--( If the data is not SGML, XML or HTML I don't think you should try to use SGML/XML/HTML tools on it. SGML and XML tools will simply not accept the data, and HTML tools will try their best, but their guess may not be what you expect. It really depends on the format of your data files, but I would probably try to first convert the data to XML, using regexps, and then use XML tools which are usually pretty fast. But that's because I am used to processing XML, and my output is usually either XML or HTML, so an XML transformation gives me the result I want. Also, the problem with your finite state machine may be the tokenizer. If you scan the input one character at a time, C-style, this may not be optimal, tokenizing using regexps may be faster.	[reply]
Re^2: Parsing SGML-ish Data Files by coolmichael (Deacon) on Aug 15, 2012 at 19:27 UTC
Well, the end goal is converting to XML. Regular expressions for the conversion aren't going to work very well, as the tags aren't properly nested as they are in XML/HTML/SGML. For example `<a><b></a></b>` is considered valid. I do think the speed problem is in the tokenizer. I am doing to the scan one character at a time (from a buffer in memory, at least). I'm not sure how I could do that with regular expressions, but it's a good idea to look into.	[reply] [d/l]
Re^3: Parsing SGML-ish Data Files by GrandFather (Saint) on Aug 16, 2012 at 03:29 UTC
If you can show us enough of the actual structure of the data and describe the constraints on tags, attributes etc, we should be able to at least sketch a regex based solution or offer other alternatives for you. True laziness is hard work	[reply]
Re: Parsing SGML-ish Data Files by Anonymous Monk on Aug 01, 2012 at 22:09 UTC
Like 45 seconds on a 900Kb file. I beat you, This can take 40 seconds on 44 char string :) The bottleneck is the parsing phase, so I'd like to speed that up somehow. I hear Marpa::XS is good for speed if you can write a BNF grammar for your language, but I've no practical experience with it :\|	[reply]