Re^4: LiBXML: New markup while preserving earlier tags?

Thank you so much again, choroba! I am incredibly grateful for the patch you have provided for your initial solution. This is my second question at perlmonks and I could not have dreamed how fruitful it would turn out to be.

I benchmarked the two alternative methods you provided in the patch of the answer (= method one) and the updated optimization (= method two). As shown in the reports of these tests below, method two is indeed faster (test-1), and in some contexts significantly faster (test-2), than method one.
But: method 2 sometimes tags fewer items than method 1 (test-3 to test-5). I think the reason for this is that method 2 only tags the first occurence of a queried item in a parents text node (test-6).

I have organized the scripts utilized in this discussion in a repository at github. If you feel uncomfortable with that, or if this is considered bad practise here, please let me know, I will then remove the scripts.

With the script "4_benchmark" at the repository the reports below should be reproduced. If I enter "4_benchmark test-1", etc. at my Ubuntu 16.04 terminal the following reports are given:

test-1 (Application of the methods to the very small canid xml-string)

Size of document to string: 122
Number of queried items: 4
---
Runtime method 1: 0 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU)
Runtime method 2: 0 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU)
---
7 item(s) tagged with both methods.
Unique item(s): he, e, quick brown fox, lazy dog
---
Benchmark: running Method One, Method Two for at least 5 CPU seconds...
Method One: 6 wallclock secs ( 5.37 usr + 0.00 sys = 5.37 CPU) @ 1375.98/s (n=7389)
Method Two: 5 wallclock secs ( 5.40 usr + 0.00 sys = 5.40 CPU) @ 1673.33/s (n=9036)
Rate Method One Method Two
Method One 1376/s -- -18%
Method Two 1673/s 22% --

I report the individual single runtime of the methods, because the actual runtime is distorted in the comparision of the two methods with benchmark as, for now, I cannot but clone the dom n-times.
Both methods tag the same number of items, method two is ca a fith faster here. The better rate of other iterations range between 19% and 23%.

test-2 to test-5 (Application of the methods to the relatively small xml file "5_sample_file", rich with non-Asccii latin unicode characters)

test-2 (method 2 is faster)

file: 5_sample_file
Size of document to string: 65715
Number of queried items: 3
---
Runtime method 1: 0 wallclock secs ( 0.04 usr + 0.00 sys = 0.04 CPU)
Runtime method 2: 0 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU)
---
46 item(s) tagged with both methods.
Unique item(s): śrī, syād, āpta
---
Benchmark: timing 350 iterations of Method One, Method Two...
Method One: 10 wallclock secs (10.40 usr + 0.00 sys = 10.40 CPU) @ 33.65/s (n=350)
Method Two: 4 wallclock secs ( 3.54 usr + 0.00 sys = 3.54 CPU) @ 98.87/s (n=350)
Rate Method One Method Two
Method One 33.7/s -- -66%
Method Two 98.9/s 194% --

Same number of items tagged satisfactory. Method Two nearly twice as fast.

test-3 to test-5 (method 2 tags fewer results)

test-3

file: 5_sample_file
Size of document to string: 65715
Number of queried items: 1
---
Runtime method 1: 1 wallclock secs ( 0.04 usr + 0.00 sys = 0.04 CPU)
Runtime method 2: 0 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU)
---
Method one tagged 62 item(s), method two 54!
Method one tagged: patra
Method two tagged: patra

No need for benchmarking of what has different results. With Text::Diff I could verify that method 1 here always tags, firstly, the same occurences as method 2, and, secondly, additional occurences.
Similarly with test-4 and test-5, which I carried out because I suspected unicode conflicts with system, dom or method 2.
But no, the differences occur regardless of unicode characters in queried items and or document.

In short test-4:
Method one tagged 47 item(s), method two 45!
Tagged: artha, vacana

test-5:
Method one tagged 26 item(s), method two 23!
Tagged: ekāṃta, pratyakṣa

The matter gets clearer, when the methods are applied to an extensively tagged xml-file:

test-6 (Application of the methods to 6_sample_file2)

file: 6_sample_file2
Size of document to string: 4354
Number of queried items: 1
---
Runtime method 1: 0 wallclock secs ( 0.02 usr + 0.00 sys = 0.02 CPU)
Runtime method 2: 0 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU)
---
Method one tagged 96 item(s), method two 21!
Method one tagged: a
Method two tagged: a
---
Show Differences: y/n?
y

@@ -10 +10 @@
- <genre>dr<start query="a"/>a<end query="a"/>m<start query="a"/>a<end query="a"/></genre>
+ <genre>dr<start query="a"/>a<end query="a"/>ma</genre>
...
@@ -95 +95 @@
- <title>The M<start query="a"/>a<end query="a"/>rti<start query="a"/>a<end query="a"/>n</title>
+ <title>The M<start query="a"/>a<end query="a"/>rtian</title>
...
+ During <start query="a"/>a<end query="a"/> manned mission to Mars, Astronaut Mark Watney is
+ presumed dead after a fierce storm and left behind by his crew.
+ But Watney has survived and finds himself stranded and alone on
+ the hostile planet. With only meager supplies, he must draw upon
+ his ingenuity, wit and spirit to subsist and find a way to
+ signal to Earth that he is alive.

Method 2 (+ above) only tags the first occurence of the queried item within the parents textnode.

***
The implementation of method 1 is sufficient for what I am now exited to do. One last report:

Size of document to string: 66252
Number of queried items: 1
---
Runtime method 1: 95 wallclock secs (95.37 usr + 0.00 sys = 95.37 CPU)
Runtime method 2: 0 wallclock secs ( 0.04 usr + 0.00 sys = 0.04 CPU)
---
Benchmark: timing 1 iterations of Method One, Method Two...
Method One: 78 wallclock secs (77.91 usr + 0.03 sys = 77.94 CPU) @ 0.01/s (n=1)
(warning: too few iterations for a reliable count)
Method Two: 0 wallclock secs ( 0.03 usr + 0.00 sys = 0.03 CPU) @ 33.33/s (n=1)
(warning: too few iterations for a reliable count)
s/iter Method One Method Two
Method One 77.9 -- -100%
Method Two 3.00e-02 259700% --
---
Method one tagged 9135 item(s), method two 502!
Tagged: a

I am happy to wait 1.5 minutes for 9135 items in this document to be tagged. Still, I dream of more complex operations, if the consistency of method 1 and the performance of method 2 (like in test-2) could be achieved.

The test can be reproduced with "4_benchmark a 1". The script accepts three arguments: (1) queried items separated by comma without space, (2) numeral for benchmark and (3) a filelocation.

Comment on Re^4: LiBXML: New markup while preserving earlier tags?

Replies are listed 'Best First'.
Re^5: LiBXML: New markup while preserving earlier tags? by choroba (Cardinal) on Jun 26, 2018 at 20:23 UTC
Oh sorry, I didn't have enough testing data. Thank you for providing them. Does the following line fix the optimized version? `$from += $from == $to++ ? 1 : 2;` [download] ($q=q:Sq=~/;[c](.)(.)/;chr(-\|\|-\|5+lengthSq)`"S\|oS2"`map{chr \|+ord }map{substrSq`S_+\|`\|}3E\|-\|`7**2-3:)=~y+S\|`+$1,++print+eval$q,q,a, [download]	[reply] [d/l] [select]
Re^6: LiBXML: New markup while preserving earlier tags? by Samantabhadra (Acolyte) on Jun 27, 2018 at 09:16 UTC
oh.my.holy.pope -- sorry bishop ;-) yes it does! test-6 file: 6_sample_file2 Size of document to string: 4354 Number of queried items: 1 --- 96 item(s) tagged with both methods. Unique item(s): a --- Benchmark: timing 1000 iterations of Method One, Method Two... Method One: 19 wallclock secs (18.96 usr + 0.00 sys = 18.96 CPU) @ 52.74/s (n=1000) Method Two: 5 wallclock secs ( 5.01 usr + 0.00 sys = 5.01 CPU) @ 199.60/s (n=1000) Rate Method One Method Two Method One 52.7/s -- -74% Method Two 200/s 278% -- All other tests also successfull. With iterations all timed over 5 wallclock seconds, better rates of method 2 are now: 287%, 300%, 435%. -- The longer it takes, the better the rates. Finally, the test where every letter a is marked in the document: a-test Size of document to string: 65715 Number of queried items: 1 --- 9103 item(s) tagged with both methods. Unique item(s): a --- Runtime method 1: 77 wallclock secs (77.06 usr + 0.12 sys = 77.18 CPU) Runtime method 2: 0 wallclock secs ( 0.43 usr + 0.00 sys = 0.43 CPU) An optimized version indeed. Thank you.	[reply]