XSL transformation vs. parsing of html

DreamT has asked for the wisdom of the Perl Monks concerning the following question:

Hi y'all,
I'm modifying an application because that it is to switch database source from Mysql (via dbi) to an xml generating database engine that delivers the resultset via an URL (using LWP for now).

Today, the system creates a webpage by populatinf a html template using traditional search/replace of tags with database values (i.e., no framework).

. Initially I started out by using XML::Bare to create a hash with the xml response, which would mean that i could "imitate" the previous behaviour of the Mysql iteration (i.e. iterating over a hash and parse the html template based on the row values). That would mean that I could keep a lot of the structure of the rest of the code, since i'm only switching "hash source".

But. The purpose of switching database engine is performance related, since the mysql is sort of a bottleneck today, and when switching to the new engine I don't want the html parsing to become the new bottleneck. I don't have much of an example to show you, but regarding to my own test (using Time::HiRes for setting measure points) I get the following output:

- Parsing the data to an xml::Bare object takes c:a 1/20 of the time compared to fetching the data
- Iterating over the object rows and parsing takes about 1/4 of the time compared fetching the data
(The source xml for the test values above is about 78 kb (average), where the relation is "the more source data, the more time parsing takes related to fetching it".)
).

So, I basically want to now how much better performance I can get by using for exampel XML::XSLT, and if it is worth it? As you may understand, switching to XSL transformation means that I have to change a lot of other data (code, templates).

Additional questions:
1. Is LWP::Simple a good choice for fetching the data?
2. Is XML::Bare a good choice performance wise? (I've tried XML::Parser but can't get comfortable with it).
3. Are there other methods that you can recommend?

As I said, I don't have any good examples to show you, but I want your "general thoughts" on the subject.

I simply want to change as least as possible, without performance loss when using the new "method". The system itself it quite complex so I want to use as much of the existing structure possible.

Comment on XSL transformation vs. parsing of html - Performance benefits

Replies are listed 'Best First'.
Re: XSL transformation vs. parsing of html - Performance benefits by codechild (Initiate) on Jun 10, 2010 at 19:01 UTC
As far as XML parsers go, XML::Bare is one of the fastest. Since it was released there are some other faster parsers, but they exchange faster parsing for slower access to the data once parsed. Based on what you are saying, you are outputting all, or most of your data each time you parse. Because of this it is unlikely to could gain higher speed via another parser. There are speed losses because an in between structure is created ( a perl hash ). That is bloated and inefficient when you have large XML files ( over 3 MB ). If you have XML files that large you should look at using a SAX parser. XML::Bare has a SAX parser as well, but it is pre-beta and very unstable. ( there are a bunch of other SAX parsers on CPAN ) Note that I wrote XML::Bare, and have been using it in a way similar to what you are describing for the last several years, so I am sort of biased. Note also that there are various glitches and problems with the current version of XML::Bare. The biggest one is that you cannot have nodes named value ex: '<value>'. Using such a node -will- crash the parser.	[reply]
Re: XSL transformation vs. parsing of html - Performance benefits by dHarry (Abbot) on Jun 11, 2010 at 07:36 UTC
I basically want to now how much better performance I can get by using for exampel XML::XSLT Who says it will be faster;) Although I'm biased towards doing things the XML way, i.e. use XSLT to transform a XML document, I sometimes encounter performance issues with XSLT. How complex is the resultset and what is the kind transformation your aiming for? The 78kb doesn't sound to scary to me. I suppose you can also limit the size of the resultset, much like you can limit the returned number of rows for a mySQL query? I wonder why mySQL was the bottleneck and what product you use now. Please provide some more info and maybe an example of a (representative) resultset and the kind of transformation you want to do. Cheers Harry	[reply]
Re^2: XSL transformation vs. parsing of html - Performance benefits by DreamT (Pilgrim) on Jun 11, 2010 at 08:46 UTC
Ok. Let's see... The xml result consists of 'articles', where each row starts with 'article', containing 36 attributes (There is some header data also) <response> <serv> <host>xxx</host> <app>yyy</app> <port>12345</port> </serv> <function> <query> <func>aaa</func> <parameters> <parameter> <name>bbb</name> <value>1</value> </parameter> <parameter> <name>ccc</name> <value>10</value> </parameter> <parameter> <name>filter</name> <value>ggg</value> </parameter> </parameters> </query> <result> <articles> <!- article starts here -> <article> <attribute1>..</attribute1> ... <attribute36>..</attribute36> </article> </articles> </result> <time>13</time> </function> </response> [download] Every article entry varies from c:a 1.4 - 2.6 kb in size, which gives us an average of 2 kb/article. The number of articles varies from 1-36 (it's hard to give an average here, since it's varies a lot), so let's say that it varies between 2-72 kb using the average value, so let's say 1-100 kb to be on the safe side. So, it's not really that much data to process. As you said I can limit the size of the resultset, but I want to calculate using these values, also to be on the safe side The MySQL database is a bottleneck because it holds so much more data than just article information. The new data source (which I can't reveal) is designed for just this kind of data, and will provide better performance.So, I'm basically tries to make the parse as good as possible compared to the "query".	[reply] [d/l]
Re^3: XSL transformation vs. parsing of html - Performance benefits by dHarry (Abbot) on Jun 11, 2010 at 10:04 UTC
I'm not convinced of the business case for using XML. It's hard to believe that a relational database is outperformed by the new datasource. But then again the datasource is a bit of a mystery:) I worked with different types of databases including native XML databases. In my experience it is hard to beat the RDBMS in terms of retrieval speed. Unless you have some exotic data structure and need many joins to grab the data together. You can often de-normalize data to fix that though. Size of the database is hardly ever a problem in my experience. I agree that it's not much data to process but XML is (very) verbose, you probably have even less data to process when you use a RDBMS. From what I can derive from your example the data structure is quite simple, i.e. rows in a table with a variable number of columns (your attributes). So I guess the transformation is also not too complex. I would not call my att attributes `attribute1 .. attribute36` though, but give them meaningful names. It's not too difficult to write some test cases and Benchmark the stuff to gain confidence in the solution. Cheers Harry	[reply] [d/l]
Re: XSL transformation vs. parsing of html - Performance benefits by Jenda (Abbot) on Jun 14, 2010 at 06:44 UTC
Either your mysql is set all wrong or you're running the new datasource on a much quicker machine. Unless you are looking for an excuse to use XML, you'd better fix your mysql. Jenda Enoch was right! Enjoy the last years of Rome.	[reply]