Re: Parsing pseudo XML files
by Masem (Monsignor) on Dec 17, 2001 at 23:26 UTC
|
One idea (thought process):
Until you empty the string:
- Strip leading whitespace
- If you start with a '<', grab up to next '>' (use non-greedy, and store into array
- Otherwise grab up to next '<', and put into array.
- Repeat
You should now have an array like (for the above) ( '<Reference1>', '<reference1_name>', 'jvdsj', ...)
Now repeat until the array is empty, with a fresh hash:
- If top element start with < but not with </, then:
- Create a key with the text of that tag, and create a new hash for this.
- Push that text into a stack array, and the hash into a stack of hashes. (That is, at all times, your current level is denoted by the [-1] element of the stack array, and the hash to insert into is the [-1] element of the stack of hashes
- If the element starts with </, then:
- Pop the last items off the the two stack arrays until the [-1] of the stack array is equal to the rest of this current tag. If you pop all the elements off, you've got a malformed element.
- Otherwise, push the element into the [-1] hash of the stack of hashes, say in the 'data' element.
This assumes that leading whitespace in the data themselves are unneeded (though that can be worked around), and that < will not be unescaped in the data spaces.
Update I'm watching other responses to this node, but I'm wondering if a very lightweight XML parser module based on this idea might be worth it. I see that someone's pointed out XML::Sax, which appears to not require external libs, but to get an XML parser requires what appears to be 3 additional modules. The above pseudo code, on the other hand, could be made into a single module, say, XML::PureSimple, which would not handle bad XML gracefully, but could be used to handle anything that follows the basic XML patterns. If anyone thinks there might be such a use of a module, drop me a msg or similar.
-----------------------------------------------------
Dr. Michael K. Neylon - mneylon-pm@masemware.com
||
"You've left the lens cap of your mind on again, Pinky" - The Brain
"I can see my house from here!"
It's not what you know, but knowing how to find it if you don't know that's important
| [reply] [d/l] [select] |
|
|
| [reply] |
Re: Parsing pseudo XML files
by mirod (Canon) on Dec 18, 2001 at 01:45 UTC
|
OK, I guess it's time to put on my XML-Ayatollah hat once again:
You have 2 choices here: if the data you want to process is really XML, then please, please, please, don't write your own parser but use either XML::SAX::PurePerl (a SAX parser) or XML::Parser::Lite (emulates XML::Parser, just check it thouroughly as it is used to parse SOAP messages in SOAP::Lite, which does not cover all of the XML spec). If the data is not XML (does it include & or < or HTML entities? Are tag names sometines not valid XML names, what about encoding, does it include some latin1 encoded accented characters?) then you can write your own parser (as an XML parser would complain and die) but then please, please, please don't call it XML. It will save you a huge amount of trouble when someone or something that expects XML tries using it.
XML is not the answer for everything. If this is an internal format then it's fine for it to be anything you want. But if you want to use XML then do it right from the start. It will pay off in the long run.
And remember that there are tools that can help you create valid XML, such as XML::Writer, or data bases XML export utilities.
| [reply] |
Re: Parsing pseudo XML files
by joealba (Hermit) on Dec 17, 2001 at 23:39 UTC
|
Not elegant, but kinda cute. It may work, depending on how strictly your data conforms to this format (minus the typo you had at reference1_title).
use Data::Dumper;
my $line = qq(<Reference1> <reference1_name>jvdsj</reference1_name> <r
+eference1_address>1234 gjrdkjpigkdj jkgpifodsjgi</reference1_address>
+ <reference1_title>njhdaslj</reference1_title> <reference1_company> j
+hfdsalh</reference1_company> <reference1_csz>Los Angeles, CA,91406</r
+eference1_csz><reference1_phone> 818-555-1212</reference1_phone> <ref
+erence1_email>wabbit\@acme.com</reference1_email></Reference1>);
my %record = reverse split /<\/(\w+)>/, $line;
foreach (keys %record) {
$record{$_} =~ s/<[^>]+>//g; # remove start tags
$record{$_} =~ s/^\s+//; # remove extra whitespace
$record{$_} =~ s/\s+$//;
delete $record{$_} unless $record{$_}; # kill the outermost record
+ tag
}
print Dumper(\%record);
PRINTS:
$VAR1 = {
'reference1_name' => 'jvdsj',
'reference1_address' => '1234 gjrdkjpigkdj jkgpifodsjgi',
'reference1_title' => 'njhdaslj',
'reference1_company' => 'jhfdsalh',
'reference1_csz' => 'Los Angeles, CA,91406',
'reference1_phone' => '818-555-1212',
'reference1_email' => 'wabbit@acme.com'
};
BTW, sorry about the unnecessary 'consider' for code tags -- I need more coffee.
Updated: Thanks, ar0n. Blame it on the coffee again.
| [reply] [d/l] |
|
|
It'd probably be more efficient to write this
$record{$_} =~ s/<.*?>//g; # remove start tag
as this:
$record{$_} =~ s/<[^>]+>//g;
as [^>] will match anything but a '>' (this way it doesn't have to backtrack)
[ ar0n -- want job (boston) ]
| [reply] [d/l] [select] |
Re: Parsing pseudo XML files
by runrig (Abbot) on Dec 18, 2001 at 00:05 UTC
|
| [reply] |
|
|
It needs to be pure perl. No external libs can be used unfortunatly. I need to make it easy to propagate to 90 someodd servers. Compiling isn't an option as the servers have had all their compilers removed for safety.
BMaximus
Update: That -1 is uncalled for. Its an informational opinion.
| [reply] |
Re: Parsing pseudo XML files
by BMaximus (Chaplain) on Dec 18, 2001 at 00:14 UTC
|
Apparently I need lots of coffee. I thought I converted all the >'s and <'s correctly. :/ I guess not. No more staying up and trying to read a Harry Potter book in one night. ++ to you Masem, joealba and ar0n. You've been a big help.
Update: Lets try to make the thank you more general. The help is much appreciated from others as well not just the formentioned :)
Thank you,
BMaximus
Since everyone is putting clever sigs at the bottom. Just imagine a really stupid one here. | [reply] |