in reply to Rosetta Code: Long List is Long
Please feel free to respond away with solutions...
Short version: J script runs in ~5.7 sec using ~650 MB to produce exactly same output. Which makes it fastest of all solutions so far.
(The caveat: "words" are supposed to be not too different in length, they are temporarily padded to longest during program run (i.e. not padded at all with current test sample), expect deterioration if, say, their lengths differ 2 or more orders of magnitude or so, especially if just a few are very long. I didn't check.)
###################### (Prose and previous results, can be skipped)
I was shocked to notice this thread is almost a month old already. While I'm in no hurry and have been pursuing what follows at leisure and only rarely (kind of "for dessert"), it's better to publish this at long last, be it "final" optimized version or not (I'm sure it can be improved a lot), before the thread is dead cold and whoever participated have to make effort to read their own code because of time elapsed.
As a reference frame, here are assortment of results of previous solutions, with my hardware:
llil2grt.pl (see 11148713) (fastest "pure core-only Perl"), Windows:
llil2grt start get_properties : 16 secs sort + output : 31 secs total : 47 secs Peak working set (memory): 2,744,200K
Same code & data & PC, Linux: (I didn't investigate why such difference.)
llil2grt start get_properties : 11 secs sort + output : 20 secs total : 31 secs 2,152,848 Kbytes of RAM were used
I assume that Judy (see 11148585) is best in both speed and memory for non-parallel Perl solutions, with same caveat as at the very top: "words" are temporarily padded to fixed length e.g. here to 10 bytes.:
my_Judy_test start get_properties : 13 secs sort + output : 7 secs total : 20 secs 349,124 Kbytes of RAM were used
Being lazy bum, I didn't compile C++ solutions (nor do I code in C++), here is a copy-paste from 11148969, I assume it is the best result so far, among C++ and all others: (For my PC, I expect time to be worse.)
llil2grt start get_properties CPU time : 4.252 secs emplace set sort CPU time : 1.282 secs write stdout CPU time : 1.716 secs total CPU time : 7.254 secs total wall clock time : 7 secs Memory use (Windows Private Bytes): 1,626,728K
###################### (End of prose and previous results)
Code below generates next message with RAM usage taken from Windows Task Manager (to be on par with how it was measured for Perl), while script pauses for a key (Ctrl-D or Ctrl-Z + Enter combo as usual for Linux or Windows, respectively) after finish:
Read and parse input: 1.636 Classify, sum, sort: 2.206 Format and write output: 1.895 Total time: 5.737 Finished. Waiting for a key... Peak working set (memory): 657,792K
The injection of CR into output lines is only required on Windows (actually, not required at all) to later ensure no difference with output from Perl. The "magic constant" 3 for number width can be any, and is only used for intermediate step.
I had to make this code slightly less readable than it was during development, by somewhat aggressively re-using over and over again same variable names for words and nums, as data are processed and modified as script progresses. They were different "self-explanatory" names at each stage, but because arrays are huge, it's better to immediately over-write variable on successive assignments to conserve memory. "Erasing" throw-away helper arrays (similar to undef in Perl) serves same purpose.
Actually, during development I was playing with this toy dataset, here's original data and result:
text =: noun define tango 1 charlie 2 bravo 1 foxtrot 4 alpha 3 foxtrot 1 bravo 1 foxtrot 7 ) NB. Do work here... ] text foxtrot 12 alpha 3 bravo 2 charlie 2 tango 1
The script:
NB. ----------------------------------------------------------- NB. --- This file is "llil.ijs" NB. --- Run as e.g.: NB. NB. jconsole.exe llil.ijs big1.txt big2.txt big3.txt out.txt NB. NB. --- (NOTE: last arg is output filename, file is overwritten) NB. ----------------------------------------------------------- args =: 2 }. ARGV fn_out =: {: args fn_in =: }: args NUM_LENGTH =: 3 PAD_CHAR =: ' ' make_sel =: [: (1 2 0& |:) @ ,: ([ ,. ] - [: , [) sort_some =: ([: /:~ {)`[`] } text =: , freads " 0 fn_in lf_pos =: I. text = LF tab_pos =: I. text = TAB words =: ((0 , >: }: lf_pos) make_sel tab_pos) ];.0 text nums =: 0&". (tab_pos make_sel lf_pos) ; @: (<;.0) text erase 'text' ; 'lf_pos' ; 'tab_pos' t1 =: (6!:1) '' NB. time since engine start nums =: words +//. nums words =: ~. words 'words nums' =: (\: nums)& { &.:>"_1 words ; nums starts =: I. ~: nums ranges =: starts ,. (}. starts , # nums) - starts count =: # starts sort_words =: monad define 'ranges words' =. y range =. ({. + i. @ {:) {. ranges (}. ranges) ; range sort_some words ) words =: > {: sort_words ^: count ranges ; words erase 'starts' ; 'ranges' t2 =: (6!:1) '' NB. time since engine start nums =: (- NUM_LENGTH) ]\ NUM_LENGTH ": nums text =: , words ,. TAB ,. (nums ,. CR) ,. LF erase 'words' ; 'nums' text =: (#~ ~: & PAD_CHAR) text text fwrite fn_out erase < 'text' t3 =: (6!:1) '' NB. time since engine start echo 'Read and parse input: ' , ": t1 echo 'Classify, sum, sort: ' , ": t2 - t1 echo 'Format and write output: ' , ": t3 - t2 echo 'Total time: ' , ": t3 echo '' echo 'Finished. Waiting for a key...' stdin '' exit 0
|
---|
Replies are listed 'Best First'. | |
---|---|
Re^2: Rosetta Code: Long List is Long (faster)
by eyepopslikeamosquito (Archbishop) on Dec 27, 2022 at 21:38 UTC | |
I was shocked to notice this thread is almost a month old already ... it's better to publish this at long last, be it "final" optimized version or not (I'm sure it can be improved a lot), before the thread is dead cold and whoever participated have to make effort to read their own code because of time elapsed Thanks for posting. I suggest you just take your time and enjoy it without worrying too much about how old the thread is. This is actually one of my favourite features of Perl Monks, so much so that I wrote a meditation about it: Necroposting Considered Beneficial. :) Your reply motivated me into finally getting around to installing Linux on my newish Windows laptop. Being lazy, I did this with a single wsl --install command to install the default Ubuntu distribution of Linux from the Microsoft Store. AFAIK, the main alternative is to install VMware, followed by multiple different Linux distros. Anyways, after doing that I saw similar performance of both my fastest Perl and C++ versions. For llil2grt.cpp on Windows 11:
On Ubuntu WSL2 Linux (5.15.79.1-microsoft-standard-WSL2), running on the same hardware, compiled with the identical g++ -o llil2grt -std=c++11 -Wall -O3 llil2grt.cpp, we see it runs slightly faster:
Sadly, it's becoming increasingly obvious that to make this run significantly faster, I'll probably have to change the simple and elegant: into something much uglier, possibly containing "emplace_hint" or other std::map claptrap ... and I just can't bring myself to do that. :) More attractive is to leave the simple and elegant one-liner alone and instead try to inject a faster custom std::map memory allocator (I have no idea how to do that yet). Conversely, my fastest Perl solution llil2grt.pl runs slightly slower on Ubuntu: compared to 10, 20, 30 secs on Windows 11. Perl v5.34.0 on Linux vs Strawberry Perl v5.32.0 on Windows. Update: while this shortened version llil2cmd-long.pl runs a bit faster on Ubuntu (but not on Windows):
The injection of CR into output lines is only required on Windows (actually, not required at all) Yes, you're right, Windows nowadays seems perfectly happy with text files terminated with \n rather than the traditional DOS \r\n. By default, Perl and C++ both output text files with "\n" on Unix and "\r\n" on Windows. I'll update my test file generator to generate identical "\n" terminated files on both Unix and Windows. Update: Test file generators updated here: Re^3: Rosetta Code: Long List is Long (Test File Generators). Curiously, \n seems to be slower than \r\n on Windows if you don't set binmode! I am guessing that chomp is slower with \n than with \r\n on a Windows text stream. Update: Are you running Strawberry Perl on Windows? Which version? (Trying to understand why your Windows Perl seems slower than mine). Update: The processor and SSD disk (see Novabench top scoring disks) on my HP laptop:
References Added Later
Updated: Noted that llil2grt.pl on Ubuntu Linux runs slightly slower on Linux than Windows, along with detailed timings. Clarified that I'm running Windows 11. Added more detail on my laptop hardware. Mentioned llil2cmd-long.pl, developed later. | [reply] [d/l] [select] |
by Anonymous Monk on Dec 28, 2022 at 13:09 UTC | |
By "same PC" to run a test under Windows vs. Linux, I meant classic dual boot and GRUB as I have, but perhaps virtualization is no longer a hindrance, and, moreover, not in this case. Because I compiled llil2grt.cpp in both OSes (command line and options you advised), here's what I got:
Same relative difference I observe running Perl script. So looks like it's not an issue of Perl and Strawberry version you asked me about (which is latest available "5.32.1.1-64bit-PDL", BTW). I, further, compiled llil2grt.cpp using minGW shell and g++ which came with older 5.26 Strawberry, and got same 10 secs. I'm clueless why this PC is slow with Windows. Perhaps either MS or Dell issued a patch in recent years to address Kaby Lake CPU (mine is i5-7500T) "vulnerability"? I vaguely remember it was said performance would suffer if such patch to guard against mythical attackers would be applied. Just a guess, sorry if it's ridiculous. On the other hand, J script performs the same in both OSes. Speaking of which, to return to "interpreted language is faster than compiled C++" -- of course it's dataset bias to a large extent. I ran test with "long" files from another test file generator you suggested previously:
(NUM_LENGTH changed to 10, otherwise same code) | [reply] [d/l] [select] |
by eyepopslikeamosquito (Archbishop) on Dec 28, 2022 at 21:54 UTC | |
Thanks for the extra information. Very useful.
By "same PC" to run a test under Windows vs. Linux, I meant classic dual boot and GRUB as I have I'd totally forgotten that 20 years ago I rushed out to buy a Unix magazine (with free CD-ROM attached!) containing GRUB, which I installed on my good old Dell PC to happily dual boot Windows and Linux for several years. :) I've felt a bit overwhelmed recently with the massive changes in the virtualization arena. AFAICT, WSL2 adds about 5% overhead compared to running Linux natively, so your dual booting performance figures (in theory) should be more accurate than mine. Even with the 5% WSL2 overhead, llil2grt runs slightly faster on Linux on my PC.
I'm clueless why this PC is slow with Windows. I'm clueless why my C++ program runs slightly faster on Ubuntu, while my Perl program runs slightly slower (timings here). :) Virtualization and Multi-boot References
Cloud References
Linux Distribution References
Cygwin References
Other
See Also
Updated: many references added long after original reply was made. | [reply] [d/l] |
Re^2: Rosetta Code: Long List is Long (faster)
by marioroy (Prior) on Dec 29, 2022 at 08:51 UTC | |
I tried the J script demonstration, by Anonymous Monk, on a box running Clear Linux. The following are the steps taken. Installation:
Script modification: Next, I made a change to not output CR so able to compare against original output without CR. For some reason, pressing a key or the return key does not exit the script (resolved by removing two lines).
Non-AVX results:
AVX2 results: From the wiki, "The initial installation is non-AVX and finishing the steps upgrades to AVX or AVX2 as appropriate."
Anonymous Monk, blessings and grace. Thank you, for the J Programming Language demonstration. | [reply] [d/l] [select] |
Re^2: Rosetta Code: Long List is Long (faster - vec)
by eyepopslikeamosquito (Archbishop) on Jan 09, 2023 at 06:02 UTC | |
J script runs in ~5.7 sec using ~650 MB to produce exactly same output. Which makes it fastest of all solutions so far. The good news is that I've now found two different C++ versions that are faster. The bad news is that they're much uglier than my original llil2grt.cpp (which ran in about 6.2 seconds on Ubuntu). After getting nowhere trying to speed up llil2grt.cpp, I reluctantly decided that the simple and elegant hash_ret[word] -= count had to go. First the timings on Ubuntu. Because there's quite a bit of natural variation between runs, I did three runs of each:
The first one still uses std::string so there are no limitations on word length. The second one is limited to a word length of six characters, and so works with big1.txt, while crashing spectacularly with long1.txt. Funny, I'd originally forgotten that way back in 2014, when desperately trying to make a search program run a hundred times faster, I'd stumbled upon this classic quote:
Back then, because each memory access into my (huge) 4GB lookup tables was essentially random, most memory accesses missed the L1 cache, missed the L2 cache, missed the L3 cache, then waited for the cache line to be loaded into all three caches from main memory, while often incurring a TLB cache miss, just to rub salt into the wounds. Thankfully, there are no 4 GB lookup tables in this problem. But hashes (and linked lists) still tend to be mind-bogglingly slower than vectors on modern hardware, as indicated by Stroustrup's quote above. So I reluctantly decided that the simple and elegant hash_ret[word] -= count had to go. Though I really hate this new std::vector based solution, it does run faster than my earlier std::map based one. This is just a first attempt, further improvements are possible. The llil2vec.cpp source code is shown below. Though there are two sets of timings, there is only one program, compiled with or without MAX_STR_LEN_L defined. Suggestions for improvements welcome. llil2vec.cpp
Read more... (9 kB)
| [reply] [d/l] [select] |
by marioroy (Prior) on Jan 09, 2023 at 10:31 UTC | |
For fixed string length (I ran several times), clang++ computes ~ 0.3 seconds faster compared to g++. See also, the J script result with AVX2 enabled (total time 3.48103 secs).
| [reply] [d/l] |
by eyepopslikeamosquito (Archbishop) on Jan 09, 2023 at 11:55 UTC | |
Thanks! Yes, I'm seeing clang++ slightly faster too, but only for the limited length fixed string case. Funny, I'd earlier tried clang++ for the non fixed string case, but it seemed to be very slightly slower than g++. That result, combined with google suggesting that g++ usually produces slightly faster executables caused me to give up on clang++. I also fiddled with some of the many compiler parameters but felt overwhelmed by the sheer number and complexity of them, so just stuck to the basic ones for now. The natural variations in timing of each run also make it hard to be certain. After googling, I was thinking of something like: but was too gutless, given it needs root permissions and I didn't really know what I was doing. | [reply] [d/l] [select] |
by marioroy (Prior) on Jan 09, 2023 at 12:59 UTC | |
by eyepopslikeamosquito (Archbishop) on Jan 10, 2023 at 12:12 UTC | |
Thanks to the excellent feedback from marioroy I've been able to improve the code as follows: New Timings for Long String Version
This is an average time of 5.35 secs (compared to 5.5 secs in the original post). New Timings for Short String (MAX_STR_LEN_L=6) Version
This is an average time of 3.4 secs (compared to 4.3 secs in the original post). New version of llil2vec.cpp
Read more... (9 kB)
Interim OpenMP Versions Note: The interim versions below are just early (conservative) openmp-safe versions made thread-safe by expunging the dreaded strtok function. The one with the most potential for multi-threading seems to be llil3vec.cpp because it uses vectors which can be safely used lock-free in thread-local vec_loc vectors, as shown by marioroy in llil4vec.cpp. While replacing std::sort and std::stable_sort with __gnu_parallel versions to sort std::vectors is a no-brainer, it seems problematic to parallelise access to a global std::map (as used in llil3grt.cpp and llil3m.cpp below) because I think you'd need some sort of locking to safely allow multiple threads update a shared global std::map. Ideas welcome. Interim version of llil3vec.cpp
Read more... (9 kB)
Timings of llil3vec
For the -fopenmp version note that the Unix real time is lower, but the user time is higher. Interim version of llil3grt.cpp
Read more... (9 kB)
Interim version of llil3m.cpp
Read more... (11 kB)
Timings
References Added Later
Updated: The source code for llil2vec.cpp was updated after the node was first created based on feedback from the replies. Added interim version llil3vec.cpp (with some marioroy improvements added). Plus interim versions of llil3grt.cpp and llil3m.cpp. Added a few missing #include <array> and reference for OpenMP Little Book (thanks marioroy). | [reply] [d/l] [select] |
by marioroy (Prior) on Jan 10, 2023 at 17:25 UTC | |
I'm not seeing a noticeable difference between the initial llil2vec demonstration and this one w.r.t. total time. So, I searched the web and tried fast_io with success. It requires a C++20 capable compiler.
Add the header file above other includes and comment out the cstdio include line.
Specify the path to the include folder. Also C++20.
Before C++11 results:
C++20 + fast_io results:
| [reply] [d/l] [select] |
by eyepopslikeamosquito (Archbishop) on Jan 11, 2023 at 11:12 UTC | |
by marioroy (Prior) on Jan 11, 2023 at 16:51 UTC | |
| |
by marioroy (Prior) on Jan 17, 2023 at 14:28 UTC | |
I tried the following for the interim version of llil3m.cpp. The overhead for running parallel is greater than I hoped for using std::map. Then, I found parallel hashmap. From the link, "By using parallel SSE2 instructions, these hashmaps are able to look up items by checking 16 slots in parallel." Running:
llil4map.cc
Read more... (20 kB)
| [reply] [d/l] [select] |
by eyepopslikeamosquito (Archbishop) on Feb 22, 2023 at 10:10 UTC | |
by marioroy (Prior) on Mar 09, 2023 at 08:40 UTC | |
by Anonymous Monk on Jan 10, 2023 at 17:45 UTC | |
The good news is... The bad news is... The good news is that I bring good news only! :) Modified J script is faster, more versatile, uses significantly less RAM, and has been tested with 9.04 engine to parallelize obvious low hanging fruits for additional speed boost.
Code above doesn't (yet) include any 9.04 features and runs OK with 9.03, but I found 9.04 slightly faster in general. I also found 9.04 a bit faster on Windows, it's opposite to what I have seen for 9.03 (script ran faster on Linux), let's shrug it off because of 9.04 beta status and/or my antique PC. Results below are for beta 9.04 on Windows 10 (RAM usage taken from Windows Task Manager):
There are 3 star-marked lines. To patch for 9.04 new features to enable parallelization, replace them with these counterparts:
As you see, 1st line replaces comment, 2nd and 3d lines require just minor touches. 2nd line launches reading and parsing of input files in parallel. 3d line says to parallelize filtering for unique words and summing numbers according to words classification. Kind of redundant double work, even as it was, as I see it. The 1st line starts 3 additional worker threads. I don't have more cores with my CPU anyway, and this script has no work easily dispatched to more workers. Then:
I would call my parallelization attempt, however crude it was, a success. Next is output for our second "official" dataset in this thread:
######################################################## These are my results for latest C++ solution (compiled using g++), to compare my efforts to:
I noticed that new C++ code, supposed to be faster, is actually slower (compared to llil2grt) with "long" dataset from two "official" datasets used in this thread. | [reply] [d/l] [select] |
by eyepopslikeamosquito (Archbishop) on Jan 11, 2023 at 11:49 UTC | |
More impressive analysis from our mysterious anonymonk.
I noticed that new C++ code, supposed to be faster, is actually slower (compared to llil2grt) with "long" dataset from two "official" datasets used in this thread Good catch! I was so hyper-focused on trying to beat the lowest time for the simple "big1.txt big2.txt big3.txt" test case that most folks were using, I didn't even test the "long1.txt long2.txt long3.txt" test case. :( With a few seconds thought it seems obvious that very long words will favour the hash-based approach over the vector-based one. I'm pleased in a way because my hash-based solution is so much simpler and clearer than my vector-based one ... plus it encourages us to explore different approaches. I'm happy for folks to continue to claim the fastest time for the short string big1.txt big2.txt big3.txt test case. That's like the 100 metre sprint. We should add a marathon world record for fastest long1.txt long2.txt long3.txt long4.txt long5.txt long6.txt test case. And maybe a 1500 metre world record for big1.txt big2.txt big3.txt long1.txt long2.txt long3.txt. :) Update: Timings for these three Olympic Events Run under Ubuntu on my modest laptop against the code here: Interim version of llil3grt.cpp and llil4vec-tbb.cpp. llil4vec easily won the 100 metre sprint; llil3grt easily won the marathon; llil4vec narrowly won the decider, the 1500 metres.
| [reply] [d/l] [select] |
by Anonymous Monk on Jan 10, 2023 at 17:54 UTC | |
What follows are mostly my attempts to describe and explain i.e. fair amount of prose, perhaps not worth a copper (insert name for local coin of lowest denomination). First of all, thanks for warm words marioroy, I'm glad you liked this cute toy. I think you'll be particularly interested in parallelization, how they do it in J. And BTW the key to press is Ctrl-D, to "slurp" from STDIN. Plus, to keep this node from being completely off-topic among Perl congregation -- some monks might also be interested in, that, if any FFI module for Perl is allowed to be used for work or play, it's very easy to execute J phrases, calling DLL and reading/writing packed Perl scalars as J data. It's been awhile since I had that fun, but, e.g., a link [1] to get started, it is about calling J from J REPL, but I remember those experiments were quite useful for learning. Whoever gets interested will easily find more tutorials. In my 1st script, I actually posted code which is borderline buggy, and felt uneasy about it since previous year:
What's right of comma in 1st line, reads into N x M character table (then comma converts it into 1D list), number of files by longest file size, with short files padded. If files were not same size, result would be erroneous. Further, if 100 files were 1 MB, and 101-st file is 1 GB, then this would end with "out of RAM" or similar error. Neither line is used in modified code, but with this off my chest, let's move on. I noticed that very slow part of my solution was (and still is) reading and parsing. Moreover this trivial part (not subsequent sorting!) is main RAM consumer during script run. Whatever I tried didn't help much. Then I decided if that's so -- let's make it at least potentially useful for the future. Instead of ad hoc code to consume strictly 2-column input, I wrote a function ("verb") to read from (well-formed) TSV file with any number of columns, with a "pattern" provided to describe how to read columns, as either text or numbers. In our case, pattern is obviously 0 1. To keep script interface consistent, this value is hard-coded, but should be e.g. CLI argument, of course. Part of what is a "well-formed" TSV file for this verb, is that "text" doesn't contain ASCII-32 space character. You maybe noticed I commented-out PAD_CHAR definition, because changing it won't be very useful. Some more substantial changes would be required to allow ASCII-32 (and then to pad with ASCII-0, perhaps), because frame-fill for literal data type is space character, nothing else. I'm keeping the line commented-out as a reminder. Which leads us to the essential question -- why keep textual data as padded N x M table, instead of keeping strings of text (they may be any different lengths then) each in their own "box"? Box in J is atomic data type and sort of pointer to whatever is inside. Data structures of any complexity are implemented with nested boxes. The answer is -- because boxes are relatively expensive [2]. A few thousands (or 1e5 or so) are OK. An array of 10 million boxes (total number of lines in our dataset) easily eats a few GB of RAM and is very slow. This also explains why I didn't use DSV addon [3] from standard library. For similar reasons (while I'm at explaining this stage of my route of decision making) I decided against symbols [4] (irreducible and shared GST (global symbol table)? Don't know what to think of it). So, textual columns are parsed into padded 2D tables, but then I saw that slurping all files into single string prior to parsing is not required at all, if files are not too tiny. And indeed, RAM requirements for the script dropped significantly then. Further, parsing file by file makes obvious choice as to where to try to parallelize. To finish with data input, the CRs are filtered out, in modified script, unconditionally, and, also unconditionally, NOT injected on output, in both OSes. And BTW, for fread and fwrite from standard library there exist their counterparts freads and fwrites to strip CRs. Don't know why, they are slow compared to manual filtering. The latter many seconds slower, not even used in original script. The modified script got rid of freads, also.Moving on to how sorting was done in 1st script version -- actually with somewhat complicated and slow code, trying to sort word sub-ranges (per the same numeric count) without creating intermediate boxed arrays. In modified script, this is replaced with just a short simple phrase. As long as there are less than a few millions of unique counts, this would be OK w.r.t. RAM usage (but 1st version would get very slow then anyway). This leads to new verb in 9.04 version (Key dyad [5]). I thought using it would help to "pack" pairs of words and numbers; then do kind of "packed sort" trick. It turned out to be a few times slower, I didn't investigate and I abandoned this route. Moving to data output, requiring user to provide "numeric width" or whatever it was called was quite silly on my part; computing decimal logarithm of maximum would be OK :) And even so, it turned out that J formats numeric column (as opposed to row i.e. 1D array) automatically padded as required. My final thoughts would be that multithreading is new feature in 9.04 version, as I already said. Documentation says [6]: Even if your code does not create tasks, the system can take advantage of threadpool 0 to make primitives run faster At first I thought it means that simply spawning workers would be enough for J to run parallel. Kind of how e.g. PDL does [7] (or tries to do). It turned out to be not so, either because script is very simple, or this automatic parallelization is not yet enabled in beta. We'll see. 1. https://code.jsoftware.com/wiki/Guides/DLLs/Calling_the_J_DLL 2. https://www.jsoftware.com/help/jforc/performance_measurement__tip.htm#_Toc191734575 3. https://code.jsoftware.com/wiki/Addons/tables/dsv 4. https://code.jsoftware.com/wiki/Vocabulary/sco 5. https://code.jsoftware.com/wiki/Vocabulary/slashdot#dyadic 6. https://code.jsoftware.com/wiki/Vocabulary/tcapdot#dyadic 7. https://metacpan.org/dist/PDL/view/Basic/Pod/ParallelCPU.pod | [reply] [d/l] [select] |
by Anonymous Monk on Jan 17, 2023 at 18:44 UTC | |
E-hm... hello? Are we still playing? Long thread is long. Should have known better before even beginning to think to start mumbling "paralle..li...", because of massive barrage of fire that ensued immediately after :). Hoping I chose correct version to test, and my dated PC (number of workers in particular) is poor workbench, but:
---------------------------------------------- Then I sent my workers to retirement to plant or pick flowers or something i.e. (temporarily) reverted to single-threaded code, walked around (snow, no flowers), made a few changes, here's comparing previous and new versions:
New script:
---------------------------------------------- I don't know C++ "tools" chosen above ("modules" or whatever they called) at all; is capping the length to "6" in code just matter of convenience; any longer value could be hard-coded instead, like "12" or "25" (with obvious other fixes)? I mean, no catastrophic (cubic, etc.) slow-down would happen to sorting after some threshold? Therefore forcing to comment-out the define and use alternative set of "tools"? Perhaps input would be slower if cutting to unequally long words is expected? Anyway, here's output if the define is commented-out:
Should my time be compared to them? :) (Blimey, my solution doesn't have to compete when participants are capped selectively (/grumpy_on around here)). Or I can use powerful magic secret turbo mode:
Inject these definitions, and these couple lines immediately after t1 =: and before t2 =: respectively:
Aha:
(and no cutting to pieces of pre-defined equal length was used ...yet) :) ---------------------------------------------- I can revert to parallel reading/parsing anytime, with effect as shown in parent node. As implemented, it was kind of passive; but files can be unequal sizes, or just one huge single file. I think serious solution would probe inside to find newlines at approx. addresses, then pass chunks coords to workers to parse in parallel. Puny 2-workers attempt to sort, in parent, was just kind of #pragma omp parallel sections... thing with 2 sections; no use to send bus-loads of workers and expect quiet fans. There's some hope for "parallelizable primitives" in release (not beta) 9.04 or later. Maybe it's long time to wait. Or, if I could write code to merge 2 sorted arrays faster than built-in primitive sorts any of the halves -- then, bingo, I have multi-threaded fast merge-sort. But no success yet, the built-in sorts one large array faster, in single-thread. | [reply] [d/l] [select] |
by marioroy (Prior) on Jan 18, 2023 at 16:47 UTC |