Parsing pseudo XML files

BMaximus has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Parsing pseudo XML files by Masem (Monsignor) on Dec 17, 2001 at 23:26 UTC
One idea (thought process): Until you empty the string: Strip leading whitespace If you start with a '<', grab up to next '>' (use non-greedy, and store into array Otherwise grab up to next '<', and put into array. Repeat You should now have an array like (for the above) `( '<Reference1>', '<reference1_name>', 'jvdsj', ...)` Now repeat until the array is empty, with a fresh hash: If top element start with < but not with </, then: Create a key with the text of that tag, and create a new hash for this. Push that text into a stack array, and the hash into a stack of hashes. (That is, at all times, your current level is denoted by the `[-1]` element of the stack array, and the hash to insert into is the `[-1]` element of the stack of hashes If the element starts with </, then: Pop the last items off the the two stack arrays until the `[-1]` of the stack array is equal to the rest of this current tag. If you pop all the elements off, you've got a malformed element. Otherwise, push the element into the `[-1]` hash of the stack of hashes, say in the 'data' element. This assumes that leading whitespace in the data themselves are unneeded (though that can be worked around), and that < will not be unescaped in the data spaces. Update I'm watching other responses to this node, but I'm wondering if a very lightweight XML parser module based on this idea might be worth it. I see that someone's pointed out XML::Sax, which appears to not require external libs, but to get an XML parser requires what appears to be 3 additional modules. The above pseudo code, on the other hand, could be made into a single module, say, XML::PureSimple, which would not handle bad XML gracefully, but could be used to handle anything that follows the basic XML patterns. If anyone thinks there might be such a use of a module, drop me a msg or similar. ----------------------------------------------------- Dr. Michael K. Neylon - mneylon-pm@masemware.com \|\| "You've left the lens cap of your mind on again, Pinky" - The Brain "I can see my house from here!" It's not what you know, but knowing how to find it if you don't know that's important	[reply] [d/l] [select]
Re: Re: Parsing pseudo XML files by clintp (Curate) on Dec 18, 2001 at 00:14 UTC
On a quest for a pure-perl replacement for XML::Parser I made this post to comp.lang.perl.misc which closely follows this strategy (follow the thread for a small bug fix). The OP might want to look at this for an example of how you might go about writing such a beast. There are better pure-perl alternatives, I know. This is just a short example of one that works.	[reply]
Re: Parsing pseudo XML files by mirod (Canon) on Dec 18, 2001 at 01:45 UTC
OK, I guess it's time to put on my XML-Ayatollah hat once again: You have 2 choices here: if the data you want to process is really XML, then please, please, please, don't write your own parser but use either XML::SAX::PurePerl (a SAX parser) or XML::Parser::Lite (emulates XML::Parser, just check it thouroughly as it is used to parse SOAP messages in SOAP::Lite, which does not cover all of the XML spec). If the data is not XML (does it include & or < or HTML entities? Are tag names sometines not valid XML names, what about encoding, does it include some latin1 encoded accented characters?) then you can write your own parser (as an XML parser would complain and die) but then please, please, please don't call it XML. It will save you a huge amount of trouble when someone or something that expects XML tries using it. XML is not the answer for everything. If this is an internal format then it's fine for it to be anything you want. But if you want to use XML then do it right from the start. It will pay off in the long run. And remember that there are tools that can help you create valid XML, such as XML::Writer, or data bases XML export utilities.	[reply]
Re: Parsing pseudo XML files by joealba (Hermit) on Dec 17, 2001 at 23:39 UTC
Not elegant, but kinda cute. It may work, depending on how strictly your data conforms to this format (minus the typo you had at reference1_title). use Data::Dumper; my $line = qq(<Reference1> <reference1_name>jvdsj</reference1_name> <r +eference1_address>1234 gjrdkjpigkdj jkgpifodsjgi</reference1_address> + <reference1_title>njhdaslj</reference1_title> <reference1_company> j +hfdsalh</reference1_company> <reference1_csz>Los Angeles, CA,91406</r +eference1_csz><reference1_phone> 818-555-1212</reference1_phone> <ref +erence1_email>wabbit\@acme.com</reference1_email></Reference1>); my %record = reverse split /<\/(\w+)>/, $line; foreach (keys %record) { $record{$_} =~ s/<[^>]+>//g; # remove start tags $record{$_} =~ s/^\s+//; # remove extra whitespace $record{$_} =~ s/\s+$//; delete $record{$_} unless $record{$_}; # kill the outermost record + tag } print Dumper(\%record); PRINTS: $VAR1 = { 'reference1_name' => 'jvdsj', 'reference1_address' => '1234 gjrdkjpigkdj jkgpifodsjgi', 'reference1_title' => 'njhdaslj', 'reference1_company' => 'jhfdsalh', 'reference1_csz' => 'Los Angeles, CA,91406', 'reference1_phone' => '818-555-1212', 'reference1_email' => 'wabbit@acme.com' }; [download] BTW, sorry about the unnecessary 'consider' for code tags -- I need more coffee. Updated: Thanks, ar0n. Blame it on the coffee again.	[reply] [d/l]
(ar0n) Re (2): Parsing pseudo XML files by ar0n (Priest) on Dec 17, 2001 at 23:48 UTC
It'd probably be more efficient to write this `$record{$_} =~ s/<.*?>//g; # remove start tag` [download] as this: `$record{$_} =~ s/<[^>]+>//g;` [download] as `[^>]` will match anything but a '>' (this way it doesn't have to backtrack) [ ar0n -- want job (boston) ]	[reply] [d/l] [select]
Re: Parsing pseudo XML files by runrig (Abbot) on Dec 18, 2001 at 00:05 UTC
There's XML::Sax::Simple which is a SAX version of XML::Simple, but can use XML::SAX::PurePerl (I assume from your description of the data that you just need an XML::Simple like parsing...). Update: re: your reply...as the name implies, XML::SAX::PurePerl does not require anything but perl. It does require XML::Simple, but remove the 'die ... if ($fatal)' line from the Makefile.PL file and install it (with 'make install' on *nix). Disregard any warnings about not having XML::Parser or failing test results. Then install XML::Handler::Trees and then XML::SAX::PurePerl (actually the whole XML::SAX bundle) and XML::SAX::Simple.	[reply]
Re: Re: Parsing pseudo XML files by BMaximus (Chaplain) on Dec 18, 2001 at 00:17 UTC
It needs to be pure perl. No external libs can be used unfortunatly. I need to make it easy to propagate to 90 someodd servers. Compiling isn't an option as the servers have had all their compilers removed for safety. BMaximus Update: That -1 is uncalled for. Its an informational opinion.	[reply]
Re: Parsing pseudo XML files by BMaximus (Chaplain) on Dec 18, 2001 at 00:14 UTC
Apparently I need lots of coffee. I thought I converted all the >'s and <'s correctly. :/ I guess not. No more staying up and trying to read a Harry Potter book in one night. ++ to you Masem, joealba and ar0n. You've been a big help. Update: Lets try to make the thank you more general. The help is much appreciated from others as well not just the formentioned :) Thank you, BMaximus Since everyone is putting clever sigs at the bottom. Just imagine a really stupid one here.	[reply]