E-hm... hello? Are we still playing?
Long thread is long. Should have known better before even beginning to think to start mumbling "paralle..li...", because of massive barrage of fire that ensued immediately after :). Hoping I chose correct version to test, and my dated PC (number of workers in particular) is poor workbench, but:
$ time ./llil3vec_11149482 big1.txt big2.txt big3.txt >vec6.tmp
llil3vec (fixed string length=6) start
get_properties CPU time : 1.80036 secs
emplace set sort CPU time : 0.815786 secs
write stdout CPU time : 1.39233 secs
total CPU time : 4.00856 secs
total wall clock time : 4 secs
real 0m4.464s
user 0m3.921s
sys 0m0.445s
$ time ./llil3vec_11149482_omp big1.txt big2.txt big3.txt >vec6.tmp
llil3vec (fixed string length=6) start
get_properties CPU time : 2.06675 secs
emplace set sort CPU time : 0.94937 secs
write stdout CPU time : 1.40311 secs
total CPU time : 4.41929 secs
total wall clock time : 4 secs
real 0m3.861s
user 0m4.356s
sys 0m0.493s
----------------------------------------------
Then I sent my workers to retirement to plant or pick flowers or something i.e. (temporarily) reverted to single-threaded code, walked around (snow, no flowers), made a few changes, here's comparing previous and new versions:
$ time ../j903/bin/jconsole llil4.ijs big1.txt big2.txt big3.txt out_j
+.txt
Read and parse input: 1.6121
Classify, sum, sort: 2.23621
Format and write output: 1.36701
Total time: 5.21532
real 0m5.220s
user 0m3.934s
sys 0m1.195s
$ time ../j903/bin/jconsole llil5.ijs big1.txt big2.txt big3.txt out_j
+.txt
Read and parse input: 1.40811
Classify, sum, sort: 1.80736
Format and write output: 0.373946
Total time: 3.58941
real 0m3.594s
user 0m2.505s
sys 0m0.991s
$ diff vec6.tmp out_j.txt
$
New script:
NB. -----------------------------------------------------------
NB. --- This file is "llil5.ijs"
NB. --- Run as e.g.:
NB.
NB. jconsole.exe llil5.ijs big1.txt big2.txt big3.txt out.txt
NB.
NB. --- (NOTE: last arg is output filename, file is overwritten)
NB. -----------------------------------------------------------
pattern =: 0 1
args =: 2 }. ARGV
fn_out =: {: args
fn_in =: }: args
filter_CR =: #~ ~: & CR
read_file =: {{
'fname pattern' =. y
text =. TAB, filter_CR fread fname
text =. TAB (I. text = LF) } text
selectors =. I. text = TAB
width =. # pattern
height =. width <. @ %~ # selectors
append_diffs =. }: , 2& (-~/\)
shuffle_dims =. (1 0 3 & |:) @ ((2, height, width, 1) & $)
selectors =. append_diffs selectors
selectors =. shuffle_dims selectors
literal =. < @: (}."1) @: (];. 0) & text "_1
numeric =. < @: (0&".) @: (; @: (<;. 0)) & text "_1
extract =. pattern & {
using =. 1 & \
or_maybe =. `
,(extract literal or_maybe numeric) using selectors
}}
read_many_files =: {{
'fnames pattern' =. y
,&.>/"2 (-#pattern) ]\ ,(read_file @:(; &pattern)) "0 fnames
}}
'words nums' =: read_many_files fn_in ; pattern
t1 =: (6!:1) '' NB. time since engine start
idx =: i.~ words
nums =: idx +//. nums
idx =: nums </. ~. idx
words =: (/:~ @: { &words)&.> idx
erase < 'idx'
nums =: ~. nums
'words nums' =: (\: nums)& { &.:>"_1 words ; nums
t2 =: (6!:1) '' NB. time since engine start
text =: ; words (, @: (,"1 _))&.(>`a:)"_1 TAB ,. (": ,. nums) ,. LF
erase 'words' ; 'nums'
text =: (#~ ~: & ' ') text
text fwrite fn_out
erase < 'text'
t3 =: (6!:1) '' NB. time since engine start
echo 'Read and parse input: ' , ": t1
echo 'Classify, sum, sort: ' , ": t2 - t1
echo 'Format and write output: ' , ": t3 - t2
echo 'Total time: ' , ": t3
exit 0
echo ''
echo 'Finished. Waiting for a key...'
stdin ''
exit 0
----------------------------------------------
I don't know C++ "tools" chosen above ("modules" or whatever they called) at all; is capping the length to "6" in code just matter of convenience; any longer value could be hard-coded instead, like "12" or "25" (with obvious other fixes)? I mean, no catastrophic (cubic, etc.) slow-down would happen to sorting after some threshold? Therefore forcing to comment-out the define and use alternative set of "tools"? Perhaps input would be slower if cutting to unequally long words is expected?
Anyway, here's output if the define is commented-out:
$ time ./llil3vec_11149482_no6 big1.txt big2.txt big3.txt >vec6.tmp
llil3vec start
get_properties CPU time : 3.19387 secs
emplace set sort CPU time : 0.996694 secs
write stdout CPU time : 1.32918 secs
total CPU time : 5.5198 secs
total wall clock time : 6 secs
real 0m6.088s
user 0m5.294s
sys 0m0.701s
$ time ./llil3vec_11149482_no6_omp big1.txt big2.txt big3.txt >vec6.tm
+p
llil3vec start
get_properties CPU time : 3.99891 secs
emplace set sort CPU time : 1.13424 secs
write stdout CPU time : 1.41112 secs
total CPU time : 6.54723 secs
total wall clock time : 4 secs
real 0m4.952s
user 0m6.207s
sys 0m0.842s
Should my time be compared to them? :) (Blimey, my solution doesn't have to compete when participants are capped selectively (/grumpy_on around here)). Or I can use powerful magic secret turbo mode:
turbo_mode_ON =: {{
assert. 0 <: c =. 8 - {: $y
h =. (3 (3!:4) 16be2), ,|."1 [3 (3!:4)"0 (4:,#,1:,#) y
3!:2 h, ,y ,"1 _ c # ' '
}}
turbo_mode_OFF =: {{
(5& }. @: (_8& (]\)) @: (2& (3!:1))) &.> y
}}
Inject these definitions, and these couple lines immediately after t1 =: and before t2 =: respectively:
words =: turbo_mode_ON words
words =: turbo_mode_OFF words
Aha:
$ time ../j903/bin/jconsole llil5.ijs big1.txt big2.txt big3.txt out_j
+.txt
Read and parse input: 1.40766
Classify, sum, sort: 1.24098
Format and write output: 0.455868
Total time: 3.1045
real 0m3.109s
user 0m1.815s
sys 0m1.210s
(and no cutting to pieces of pre-defined equal length was used ...yet) :)
----------------------------------------------
I can revert to parallel reading/parsing anytime, with effect as shown in parent node. As implemented, it was kind of passive; but files can be unequal sizes, or just one huge single file. I think serious solution would probe inside to find newlines at approx. addresses, then pass chunks coords to workers to parse in parallel.
Puny 2-workers attempt to sort, in parent, was just kind of #pragma omp parallel sections... thing with 2 sections; no use to send bus-loads of workers and expect quiet fans. There's some hope for "parallelizable primitives" in release (not beta) 9.04 or later. Maybe it's long time to wait. Or, if I could write code to merge 2 sorted arrays faster than built-in primitive sorts any of the halves -- then, bingo, I have multi-threaded fast merge-sort. But no success yet, the built-in sorts one large array faster, in single-thread.