Thank you so much again, choroba! I am incredibly grateful for the patch you have provided for your initial solution. This is my second question at perlmonks and I could not have dreamed how fruitful it would turn out to be.
I benchmarked the two alternative methods you provided in the patch of the answer (= method one) and the updated optimization (= method two). As shown in the reports of these tests below, method two is indeed faster (test-1), and in some contexts significantly faster (test-2), than method one.
But: method 2 sometimes tags fewer items than method 1 (test-3 to test-5). I think the reason for this is that method 2 only tags the first occurence of a queried item in a parents text node (test-6).
I have organized the scripts utilized in this discussion in a repository at
github. If you feel uncomfortable with that, or if this is considered bad practise here, please let me know, I will then remove the scripts.
With the script "4_benchmark" at the repository the reports below should be reproduced. If I enter "4_benchmark test-1", etc. at my Ubuntu 16.04 terminal the following reports are given:
test-1 (Application of the methods to the very small canid xml-string)
Size of document to string: 122
Number of queried items: 4
---
Runtime method 1: 0 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU)
Runtime method 2: 0 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU)
---
7 item(s) tagged with both methods.
Unique item(s): he, e, quick brown fox, lazy dog
---
Benchmark: running Method One, Method Two for at least 5 CPU seconds...
Method One: 6 wallclock secs ( 5.37 usr + 0.00 sys = 5.37 CPU) @ 1375.98/s (n=7389)
Method Two: 5 wallclock secs ( 5.40 usr + 0.00 sys = 5.40 CPU) @ 1673.33/s (n=9036)
Rate Method One Method Two
Method One 1376/s -- -18%
Method Two 1673/s 22% --
I report the individual single runtime of the methods, because the actual runtime is distorted in the comparision of the two methods with benchmark as, for now, I cannot but clone the dom n-times.
Both methods tag the same number of items, method two is ca a fith faster here. The better rate of other iterations range between 19% and 23%.
test-2 to test-5 (Application of the methods to the relatively small xml file "5_sample_file", rich with non-Asccii latin unicode characters)
test-2 (method 2 is faster)
file: 5_sample_file
Size of document to string: 65715
Number of queried items: 3
---
Runtime method 1: 0 wallclock secs ( 0.04 usr + 0.00 sys = 0.04 CPU)
Runtime method 2: 0 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU)
---
46 item(s) tagged with both methods.
Unique item(s): śrī, syād, āpta
---
Benchmark: timing 350 iterations of Method One, Method Two...
Method One: 10 wallclock secs (10.40 usr + 0.00 sys = 10.40 CPU) @ 33.65/s (n=350)
Method Two: 4 wallclock secs ( 3.54 usr + 0.00 sys = 3.54 CPU) @ 98.87/s (n=350)
Rate Method One Method Two
Method One 33.7/s -- -66%
Method Two 98.9/s 194% --
Same number of items tagged satisfactory. Method Two nearly twice as fast.
test-3 to test-5 (method 2 tags fewer results)
test-3
file: 5_sample_file
Size of document to string: 65715
Number of queried items: 1
---
Runtime method 1: 1 wallclock secs ( 0.04 usr + 0.00 sys = 0.04 CPU)
Runtime method 2: 0 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU)
---
Method one tagged 62 item(s), method two 54!
Method one tagged: patra
Method two tagged: patra
No need for benchmarking of what has different results. With Text::Diff I could verify that method 1 here always tags, firstly, the same occurences as method 2, and, secondly, additional occurences.
Similarly with test-4 and test-5, which I carried out because I suspected unicode conflicts with system, dom or method 2.
But no, the differences occur regardless of unicode characters in queried items and or document.
In short test-4:
Method one tagged 47 item(s), method two 45!
Tagged: artha, vacana
test-5:
Method one tagged 26 item(s), method two 23!
Tagged: ekāṃta, pratyakṣa
The matter gets clearer, when the methods are applied to an extensively tagged xml-file:
test-6 (Application of the methods to 6_sample_file2)
file: 6_sample_file2
Size of document to string: 4354
Number of queried items: 1
---
Runtime method 1: 0 wallclock secs ( 0.02 usr + 0.00 sys = 0.02 CPU)
Runtime method 2: 0 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU)
---
Method one tagged 96 item(s), method two 21!
Method one tagged: a
Method two tagged: a
---
Show Differences: y/n?
y
@@ -10 +10 @@
- <genre>dr<start query="a"/>a<end query="a"/>m<start query="a"/>a<end query="a"/></genre>
+ <genre>dr<start query="a"/>a<end query="a"/>ma</genre>
...
@@ -95 +95 @@
- <title>The M<start query="a"/>a<end query="a"/>rti<start query="a"/>a<end query="a"/>n</title>
+ <title>The M<start query="a"/>a<end query="a"/>rtian</title>
...
+ During <start query="a"/>a<end query="a"/> manned mission to Mars, Astronaut Mark Watney is
+ presumed dead after a fierce storm and left behind by his crew.
+ But Watney has survived and finds himself stranded and alone on
+ the hostile planet. With only meager supplies, he must draw upon
+ his ingenuity, wit and spirit to subsist and find a way to
+ signal to Earth that he is alive.
Method 2 (+ above) only tags the first occurence of the queried item within the parents textnode.
***
The implementation of method 1 is sufficient for what I am now exited to do. One last report:
Size of document to string: 66252
Number of queried items: 1
---
Runtime method 1: 95 wallclock secs (95.37 usr + 0.00 sys = 95.37 CPU)
Runtime method 2: 0 wallclock secs ( 0.04 usr + 0.00 sys = 0.04 CPU)
---
Benchmark: timing 1 iterations of Method One, Method Two...
Method One: 78 wallclock secs (77.91 usr + 0.03 sys = 77.94 CPU) @ 0.01/s (n=1)
(warning: too few iterations for a reliable count)
Method Two: 0 wallclock secs ( 0.03 usr + 0.00 sys = 0.03 CPU) @ 33.33/s (n=1)
(warning: too few iterations for a reliable count)
s/iter Method One Method Two
Method One 77.9 -- -100%
Method Two 3.00e-02 259700% --
---
Method one tagged 9135 item(s), method two 502!
Tagged: a
I am happy to wait 1.5 minutes for 9135 items in this document to be tagged. Still, I dream of more complex operations, if the consistency of method 1 and the performance of method 2 (like in test-2) could be achieved.
The test can be reproduced with "4_benchmark a 1". The script accepts three arguments: (1) queried items separated by comma without space, (2) numeral for benchmark and (3) a filelocation.