Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks!
I have XML files in a directory and I am trying to make things very simple by just opening these files and check the existence of the first and last element - <order> and </order> , to make sure I don't get any bad XML in my other Perl code, and eventually delete the bad formatted XML file(s), but having problems trying to match it in my regular expression, can anyone help me with that?

my $xml_dir = "c://apache//htdocs//xml"; opendir(DIR, $xml_dir); my @files = grep { /\.xml$/ } readdir(DIR); closedir(DIR); foreach my $file (@files) { if($file =~/<order>(.*)<\/order>/g){ print "FILE::: $file\n"; } }

XML Sample File 1:
<order> <customer> <name>Coyote, Ltd.</name> <shipping_info> <address>1313 Desert Road</address> <city>Nowheresville</city> <state>AZ</state> <zip>90210</zip> </shipping_info> </customer> <item> <product id="1111">Acme Rocket Jet Pack</product> <quantity type="each">1</quantity> </item> <item> <product id="2222">Roadrunner Chow</product> <quantity type="bag">10</quantity> </item> </order>


XML Sample File 2 - This file is not valid:
<order> <customer> <name>Coyote, Ltd.</name> <shipping_info> <address>1313 Desert Road</address> <city>Nowheresville</city> <state>AZ</state> <zip>90210</zip> </shipping_info> </customer> <item> <product id="1111">Acme Rocket Jet Pack</product> <quantity type="each">1</quantity> </item> <item> <product id="2222">Roadrunner Chow</product> <quantity type="bag">10</quantity> </item>


Thanks!!!

Replies are listed 'Best First'.
Re: Regular Expression XML Searching Help
by Your Mother (Archbishop) on Jun 16, 2008 at 17:03 UTC

    I'm guessing your real intention is to check for well-formedness. The question also implies you're doing XML parsing in a fragile way elsewhere. If I'm right, the following might be a good start to doing things in a way that's more bomb-proof and easier to extend. See XML::LibXML for more.

    use XML::LibXML; local $/ = "::FILE::"; my $parser = XML::LibXML->new(); # $parser->recover(1); <-- turn on to "save" many bad docs. while ( my $xml = <DATA> ) { chomp($xml); my $doc = eval { $parser->parse_string($xml) }; if ( $doc ) { print "File $. is valid.\n"; # Do whatever you want with your valid $doc here. } else { print "File $. is NOT valid.\n"; # Deal with bad docs here... } } __DATA__ <order> <customer> <name>Coyote, Ltd.</name> <shipping_info> <address>1313 Desert Road</address> <city>Nowheresville</city> <state>AZ</state> <zip>90210</zip> </shipping_info> </customer> <item> <product id="1111">Acme Rocket Jet Pack</product> <quantity type="each">1</quantity> </item> <item> <product id="2222">Roadrunner Chow</product> <quantity type="bag">10</quantity> </item> </order> ::FILE:: <order> <customer> <name>Coyote, Ltd.</name> <shipping_info> <address>1313 Desert Road</address> <city>Nowheresville</city> <state>AZ</state> <zip>90210</zip> </shipping_info> </customer> <item> <product id="1111">Acme Rocket Jet Pack</product> <quantity type="each">1</quantity> </item> <item> <product id="2222">Roadrunner Chow</product> <quantity type="bag">10</quantity> </item>
      That is a better solution!!!
      Thanks for the help!
      I can't install the module XML::LibXML on this windows box, is there any other Perl module that would work with this code example?

        Not directly, no. But you could try to install XML::Twig (or one of the other good ones) and try to adapt the recipe. Even an eval around an XML::Simple::XMLin() might work. I don't recommend the module but if you've got it already...

        I'm no expert on Win installs but you could try to install the C lib for libxml before trying to install the Perl modules. Might be the only problem. There is some really good work lately with Strawberry Perl to make Perl behave more like it does on other OSes. If you don't have it, try it maybe(?).

Re: Regular Expression XML Searching Help
by mirod (Canon) on Jun 16, 2008 at 17:31 UTC

    In case you ever have other types of searches to perform on the XML, you can try xml_grep2, which you will find in my tool box. Or of course xml_grep, which comes with XML::Twig (and I am sure Jenda will have a similar tool based on XML::Rules in the next 5 minutes ;--)

Re: Regular Expression XML Searching Help
by ikegami (Patriarch) on Jun 16, 2008 at 16:48 UTC
    That's not valid XML. It's not even well-formed XML. An XML validator would take care of that. Searching on CPAN for "xml validator" finds some Perl solutions, and Google finds non-Perl solutions. No need to reinvent the wheel.
Re: Regular Expression XML Searching Help
by pc88mxer (Vicar) on Jun 16, 2008 at 17:10 UTC
    To do it your way, you need to read in the contents of the file and use the /s modifier on your regular expression:
    use File::Slurp; ... for my $file (@files) { my $content = read_file($file); unless ($content =~ m{<order>(.*)</order>}s) { # bad file } else { # found <order>...</order> } }
    Without the /s modifier, a dot (.) will not match a newline.
Re: Regular Expression XML Searching Help
by ikegami (Patriarch) on Jun 16, 2008 at 16:52 UTC
    Why did you use the g modifier on your match op? This is the third time since the beginning of the month someone's come to Perl Monks with that bug.

    if (/.../g) is almost guaranteed to be a bug.
    while (/.../g) would make sense.

      I tried with and without he "/g", and it still didn't work, also tried using a "while" other than "if", no luck!
        I didn't say it would work without the 'g' or with a while. I said it was wrong as is.
Re: Regular Expression XML Searching Help
by Jenda (Abbot) on Jun 16, 2008 at 17:25 UTC
Re: Regular Expression XML Searching Help
by Jenda (Abbot) on Jul 07, 2008 at 23:50 UTC

    Thinking about this one more time ... in case the question actually is not "is the XML valid", but rather "is the XML already complete, assuming it's being uploaded from a valid source" then the XML parser modules based solutions are an overkill. For a quick check with minimal memory footprint something like this may be better:

    sub XMLisComplete { my $file = shift(); open my $IN, '<', $file or return; # if I can't open it, it's prob +ably locked. Therefore it's not complete. my $main_tag; read $IN, $main_tag, 1024; if ($main_tag =~ m{<(\w+)}) { $main_tag = $1; } else { return; # there's not even the opening tag! } seek $IN, 2, -100; my $end = do {local $/; <$IN>}; close $IN; if ($end =~ m{</$main_tag>\s*$}s) { return 1; } else { return; } }

    It's most likely not 100% standard proof (e.g. I bet \w+ doesn't match all allowed tag names), but it works for me to test whether I can already start parsing the uploaded file or whether to wait a bit more. The actual parsing would of course be best left to a module.

Re: Regular Expression XML Searching Help
by richb (Scribe) on Jun 17, 2008 at 18:05 UTC
    In your loop "foreach my $file (@files)" -- $file will have the name of the XML file, not the contents of the file. As mentioned above, you need to read the contents of each file and use the s modifier on the regex to make the . match newlines:
    my $xml_dir = "c://temp"; opendir(DIR, $xml_dir); my @files = grep { /\.xml$/ } readdir(DIR); closedir(DIR); foreach my $file (@files) { open my $fh, "$xml_dir//$file" or die "can't open $file: $!"; local $/; my $contents = <$fh>; close $fh; print "$file is "; if($contents !~ /<order>(.*)<\/order>/s){ print "NOT "; } print "valid\n"; }
    Results:

    invalid.xml is NOT valid #this is your sample file 2

    valid.xml is valid #this is your sample file 1