matrixmadhan has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

I need your esteemed pointers or advice to improve the performance of my script

Following is a code snippet for validating a generated XML file against base XSD
# TMPXML generation goes here by some other script call xmllint --noout --schema base-schema.xsd $TMPXML 2>/dev/null 1>&2
only if the above is validated the record would be collected and sent to another process.

Currently, for processing 100 such records just the xmlllint part alone takes about 6.5 minutes on an average to process

Is there any way to improve the approach or cache base-schema.xsd such that its not loaded and processed each time xmllint is being invoked?

Any other pointers to improve the approach?

Many thanks in advance.

Replies are listed 'Best First'.
Re: Improving script that uses xmllint to validate
by Anonymous Monk on Dec 26, 2008 at 07:55 UTC
    That would depend on xmllint. The xmllint program parses one or more XML files, specified on the command line as xmlfile.
      Thanks for the reply.

      Currently am parsing only one xml file with xmllint

      Passing multiple files to xmllint involves the overhead of creating multiple files, cleaning them after it is being used etc.
        Hmm, try
        #!/usr/bin/perl -- use strict; use warnings; use XML::LibXML; my $schemafile = 'base-schema.xsd'; my $schema = XML::LibXML::RelaxNG->new( location => $schemafile); my $filename = 'foo.xml'; my $parser = XML::LibXML->new(); my $doc = $parser->parse_file($filename); $schema->validate($doc); print "\n\nWe've made it this far without dying (ie you read this mess +age) $schemafile is valid, and so is $filename \n\n"
Re: Improving script that uses xmllint to validate
by matrixmadhan (Beadle) on Dec 27, 2008 at 17:56 UTC
    I got the problem why the script was running very slow.

    Actually, there is no problem with the XML validation part, problem was with the XML generation module which depends on a hash typically containing 1M entries in it.

    Each and every record needs to be validated before pushing it to an external module, something like
    for record ( list of records ) do create xml file with 1 record of data validate the above xml file if validated then push record to success file else push record to failure file fi done use success file to generate final xml file
    So, for each and every xml file generation with 1 record of data ( for validation ) will create a hash with 1M entries, that is the reason the script is performing so bad.

    Now the question is - is there any way to 'pin' the created hash structure to memory so that any process making use of the data can refer to it using some 'memory namespace'.

    In case of XSD parsing, such a file caching operation is possible using File::Cache, is there anything similar to that available?

    If I could pin hash data structure to memory without re-creating it each and every time, there would be a mega improvement in the performance of my script.

    Thanks in advance ! :)
      Now the question is - is there any way to 'pin' the created hash structure to memory so that any process making use of the data can refer to it using some 'memory namespace'.
      Whats that mean? Maybe I'm ignorant, but isn't that how every program works?
        Hash gets created in the memory and will stay as long as the instance ( process ) that built is resident in the memory.

        What am looking for is -
        Process 'A' is spawned Build hash data structure with 1 M entries Assign a namespace ( example name : h-m1 ) to the above Now process 'A' dies off, but h-m1 should still be resident in the mem +ory that is they have to be resident even after the process which cre +ated terminates.
        Now when another instance of process 'A' is spawned, without having to reconstruct the hash data structure with 1M entries, it should simply do a lookup using the namespace 'h-m1' ( or something like that )

        Am trying to achieve some optimization on the time taken to construct the hash of 1M entries which is a big win for my script

        Is that possible? Many thanks in advance :)