WouterVG has asked for the wisdom of the Perl Monks concerning the following question:
Dear all,
I am currently trying to shuffle the codons of a DNA sequence. The sequence needs to be split in groups of 3 characters and those groups need to be shuffled. So far I was able to split the sequence in groups of 3. However, I do not succeed at shuffling them randomly. I am uncapable of installing the list utils... I do not seem to find a good tutorial for macOS.... And I am unable to correctly introduce the Fisher Yates algorithm into my source code... Another noob has entered the monastery... My code so far :
print "enter sequence and signal end with enter followed by ctrl d\n" +; $sequence = <STDIN>; chomp $sequence; print "sequence inserted : $sequence\n"; @trips = unpack("a3" x (length($sequence)-2), $sequence); @trips = join(" ", @trips);
so I am looking to shuffle @trips and then store it into @shuftrips for example.
Kind regards,
Wouter
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Shuffling CODONS
by BrowserUk (Patriarch) on Jun 07, 2018 at 13:51 UTC | |
Here's a proper (tested), efficient, in-place, pure Perl, Fisher Yates shuffle:
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
In the absence of evidence, opinion is indistinguishable from prejudice.
Suck that fhit
| [reply] [d/l] |
|
Re: Shuffling codons
by hippo (Archbishop) on Jun 07, 2018 at 13:29 UTC | |
It's an FAQ! I am uncapable of installing the list utils List::Util is in core and has been since 5.7.3 so you probably don't even need to install it. Either way, I'd tackle this problem (module installation) if I were in your shoes as fixing it will solve many more problems that just the one in your post. In what way are you incapable of installing it? | [reply] |
by WouterVG (Novice) on Jun 07, 2018 at 13:44 UTC | |
Thanks for your reply. I know it is an FAQ, That's where I got the Fisher Yates option from, But unable to correctly implement it in my code.. If I try the list util it doesnt work... My code looks like this
What is wrong/missing here? I am incapable of installing the list::utils, I wouldn't know where to unpack them and how to properly install and run them. Do they need to be situated somewhere. I am on perl v5.18.2, so the list utils should already be there you mention? How would I implement the fisher Yates algorithm correctly? | [reply] [d/l] |
by pryrt (Abbot) on Jun 07, 2018 at 14:12 UTC | |
perl v5.18.2 came after perl v5.7.3, so yes, it should be installed for you. However, the name is List::Util, not list::utils. The following shows that list::utils doesn't exist, but that List::Util exists and has been in core since v5.7.3:
... Besides, if your example code, which included use List::Util 'shuffle'; compiled at all, then you already knew you had it installed properly. Oh, there, that's what you're doing wrong: @trips = join(" ", @trips);. You just replaced the contents of the @trips array with a single value, which is a string which joins all the old elements of @trips with a space. I think what you want is more akin to:
| [reply] [d/l] [select] |
by hippo (Archbishop) on Jun 07, 2018 at 14:13 UTC | |
If I try the list util it doesnt work Not enough information there. Error message? Segfault? Compilation error? What is wrong/missing here? This line:
ruins it. What do you understand this line to be doing? Without it I get at least some shuffling which may or may not be what you want:
which gives:
I am on perl v5.18.2, so the list utils should already be there you mention? Yes. | [reply] [d/l] [select] |
|
Re: Shuffling CODONS
by tybalt89 (Monsignor) on Jun 07, 2018 at 13:39 UTC | |
My favorite shuffle
| [reply] [d/l] |
by BrowserUk (Patriarch) on Jun 07, 2018 at 14:41 UTC | |
Why is that a favorite? I get it'll shuffle anything correctly, but trading O(NlogN) for O(1) is painful: <Reveal this spoiler or all in this thread>
Benchmark:
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
In the absence of evidence, opinion is indistinguishable from prejudice.
Suck that fhit
| [reply] [d/l] [select] |
by WouterVG (Novice) on Jun 07, 2018 at 14:02 UTC | |
Thanks for your reply! If I implement it like below. It doesn't change the order of codons.
| [reply] [d/l] |
by tybalt89 (Monsignor) on Jun 07, 2018 at 14:23 UTC | |
Because @trips = join(" ", @trips); joins all codons into ONE string! | [reply] [d/l] |
by tobyink (Canon) on Jun 07, 2018 at 14:48 UTC | |
|
Re: Shuffling CODONS
by thanos1983 (Parson) on Jun 07, 2018 at 13:39 UTC | |
Hello WouterVG, Welcome to the Monastery. You can search the forum it contains a lot of information as your question was asked before e.g. How do I shuffle an array? Something like that should do what you want, assuming you want to shuffle array element (3 characters).
Update: In case you want to use unpack see bellow:
Hope this helps, BR.
Seeking for Perl wisdom...on the process of learning...not there...yet!
| [reply] [d/l] [select] |
|
Re: Shuffling CODONS
by bliako (Abbot) on Jun 08, 2018 at 12:44 UTC | |
Apart from time benchmarks, the suitability of the shuffle algorithm must be assessed with respect to the quality of the randomness of the shuffled array. One way to do this is to calculate the auto-correlation of the shuffled sequence with lag 1 (looking at consecutive elements). The absolute value of the a-c coefficient approaches 1 when the sequence is highly auto-correlated (for example the test array 1..1000) and zero when the opposite happens. So, a good quality shuffle should produce auto-correlations approaching zero. Edit: suggested test scenario: start with a highly correlated array (e.g 1..1000: perl -MStatistics::Autocorrelation -e 'print Statistics::Autocorrelation->new()->coefficient(data=>[1..1000],lag=>1)."\n"' yields 0.997) and see how the shuffling algorithm de-auto-correlates it by lowering its auto-correlation coefficient towards zero. Edit 2: auto-correlation coefficient is in the range -1 to 1. Both extremes are for higlhy auto-correlated sequences and zero for no auto-correlation. In this test I take the absolute value of the coefficient. The following script compares the three methods mentioned here by BrowserUK, tybalt89, List::Util/shuffle with respect to auto-correlation and also, for each trial it plots a histogram of the differences between consecutive elements of the shuffled array, just for fun. The best shuffle is the one who produces the lowest mean auto-correlation with lowest variance and most successes (i.e. it had the minimum auto-correlation at a specific trial).
once more:
once more:
My opinion: all algorithms work well with respect to randomness (as assessed by auto-correlation) and now we can move to time benchmarks. TODO: try with a different random number generator (i.e. more reliably uniform). The test program:Read more... (5 kB) | [reply] [d/l] [select] |
by BrowserUk (Patriarch) on Jun 08, 2018 at 16:05 UTC | |
TODO: try with a different random number generator (i.e. more reliably uniform) Unless the size of the array is approaching the period length of the PRNG, then the quality of the PRNG has little or no effect upon the quality of the shuffling. Neither the quality of any given shuffle; nor the quality of successive shuffles; nor the quality of any group of shuffles. I'm not going to try and explain that beyond saying it is the nature of modular arithmetic; but by way of evidence I offer this. The standard PRNG used in perl 5.10 on windows is the notorious MSC rand that has only 15-bits: 0-32767. However, if you use F-Y to shuffle any size array with less that 32767 elements; no matter a how many times you do it (within the bounds of reasonable values: say a human lifetime) then you will not detect any bias using simple std deviations or other simple correlation tools. Eg. Run this code with any $L < 32767, and any $N, and try to find some bias using your test and you will fail:
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
In the absence of evidence, opinion is indistinguishable from prejudice.
Suck that fhit
| [reply] [d/l] [select] |
by BrowserUk (Patriarch) on Jun 08, 2018 at 16:25 UTC | |
Maybe this explains it better? The following program uses this heavily biased PRNG:
which is designed to produce 0, 50% of the time. That means that when asked for numbers 0-9, it produces '0', 10 times more often than any of the other values:
And when asked for 0..23, produces '0' 25 times more often than any other number:
But when that biased PRNG used within a standard Fisher Yates shuffle, the shuffles are fair, no matter how many times you run it:
The test code:
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
In the absence of evidence, opinion is indistinguishable from prejudice.
Suck that fhit
| [reply] [d/l] [select] |
by bliako (Abbot) on Jun 09, 2018 at 22:47 UTC | |
But when that biased PRNG used within a standard Fisher Yates shuffle, the shuffles are fair, no matter how many times you run it I am not convinced. A statistical test for assessing statistical fairness is chi-square: throw a dice a million times and feed chi-square with the counts for each outcome: '1' => ..., '2' => ... , ... '6' => .... It will tell you whether it thinks the dice is fair or not. So I run your test code with 3 rand-subs:
Then I run the chi-squared test (Statistics::ChiSquare) on the counts of the shuffle (re: the output of your test code). This is the result: bad_shuffle : There's a <1% chance that this data is random. good_shuffle : There's a >25% chance, and a <50% chance, that this data is random. best_shuffle : There's a >25% chance, and a <50% chance, that this data is random. or bad_shuffle : There's a <1% chance that this data is random. best_shuffle : There's a >50% chance, and a <75% chance, that this data is random. good_shuffle : There's a >10% chance, and a <25% chance, that this data is random. or bad_shuffle : There's a <1% chance that this data is random. best_shuffle : There's a >25% chance, and a <50% chance, that this data is random. good_shuffle : There's a >50% chance, and a <75% chance, that this data is random. The above test consistently rates "bad_shuffle" based on badRand() as producing data with less than 1% chance of being random. It is undecided, let's say, which is best for this particular shuffle: Mersenne or Perl. The results your test code produced may look fair but they aint according to this particular test - it very much depends on use-case (see end of post). Although you may be right with your hunch that Mersenne may not offer anything more than perl's standard rand() for FY shuffle but hey let's not exaggerate with badRand()! :) This is my test code: Read more... (2 kB)
sidenote: A particular shuffle may be appropriate for a particular situation. Such as?
| [reply] [d/l] [select] |
by BrowserUk (Patriarch) on Jun 10, 2018 at 08:48 UTC | |
by bliako (Abbot) on Jun 10, 2018 at 11:41 UTC | |
| |
by BrowserUk (Patriarch) on Jun 10, 2018 at 07:39 UTC | |
|
Re: Shuffling CODONS
by AnomalousMonk (Archbishop) on Jun 07, 2018 at 17:32 UTC | |
WouterVG: Further to thanos1983's reply: The reason thanos1983's use of the '(a3)*' unpacking template in the Update unpack example given there is much better than the template (Note that Data::Dumper is a core module.) Give a man a fish: <%-{-{-{-< | [reply] [d/l] [select] |
|
Re: Shuffling CODONS
by swl (Prior) on Jun 10, 2018 at 09:37 UTC | |
It is also worth looking at https://metacpan.org/pod/Math::Random::MT::Auto#shuffle for an implementation of the Fisher-Yates shuffle with the Mersenne Twister PRNG algorithm (which has been noted in several of the other responses). | [reply] |
|
Re: Shuffling CODONS
by Anonymous Monk on Jun 07, 2018 at 18:32 UTC | |
Hello there. I realize this may be just a learning exercise, but the mention of a 'proper' shuffle came up above... A shuffle is a random list permutation, of which there are n! possibilities. A 'proper' shuffle would emphasize the quality of the random source. PRNG needs at least log2(n!) bits of internal state to be able to generate all possible permutations. List::Util shuffle is likely using drand48. $ perl -MList::Util=sum -e 'print 1/log(2) * sum map log, 1..17;' 48.337603311133 $ perl -MList::Util=sum -e 'print 1/log(2) * sum map log, 1..1000;' 8529.39800420478 As you can see, 48 bits is not enough to properly shuffle a list of 17 elements. For one thousand element shuffle, more than a kilobyte of randomness is required. | [reply] |
by BrowserUk (Patriarch) on Jun 07, 2018 at 19:07 UTC | |
As you can see, 48 bits is not enough to properly shuffle a list of 17 elements. For one thousand element shuffle, more than a kilobyte of randomness is required. Utter twaddle! What I see is someone adding 2 + 1/log2 * sum of the log2 of an arbitrary list and drawing a random, and wrong, conclusion. The Knuth-Fisher-Yates shuffle only needs to be able to pick 1 from N -- that is, generate a uniform random number between 1 and N; where N is the number of elements in the array -- as proven by Donald Knuth, arguably the greatest Computer Scientist of the current era. Discuss. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
In the absence of evidence, opinion is indistinguishable from prejudice.
Suck that fhit
| [reply] |
by huck (Prior) on Jun 07, 2018 at 21:15 UTC | |
"arguably the greatest Computer Scientist of the current era. " But they were not called the "Donald of Computer Programming", he looked up to someone else "One of the first times I was ever asked about the title of my books was in 1966, during the last previous ACM national meeting held in Southern California. This was before any of the books were published, and I recall having lunch with a friend at the convention hotel. He knew how conceited I was, already at that time, so he asked if I was going to call my books "An Introduction to Don Knuth." I replied that, on the contrary, I was naming the books after him. His name: Art Evans. (The Art of Computer Programming, in person.)
" "Always Mount a Scratch Monkey" https://www.acme.com/jef/netgems/scratch_monkey.html | [reply] |
by Anonymous Monk on Jun 08, 2018 at 19:12 UTC | |
... uniform random number between 1 and N; where N is the number of elements in the array ...Correction: several such random numbers. Alternatively, one random number 1..N, where N is the number of permutations. The shuffle algorithm needs a random number sequence of finite length. A PRNG with 48 bits of internal state can generate at most 248 different sequences (of some particular length), because it is deterministic. If there are more permutations than that, some will never be selected. Those using GNU/Linux can take a quick glimpse at the working of the standard shuffle tool, shuf.
$ strace -s0 -e open,read shuf -o /dev/null -i 1-17
...
open("/dev/urandom", O_RDONLY) = 3
read(3, ""..., 11) = 11
...
$ strace -s0 -e open,read shuf -o /dev/null -i 1-1000
...
open("/dev/urandom", O_RDONLY) = 3
read(3, ""..., 1250) = 1250
...
Here, in order to shuffle a list of integers 1-17, shuf actually wants 11 random bytes. For 1000 elements, shuf reads 1250 bytes of /dev/urandom.
| [reply] |
by BrowserUk (Patriarch) on Jun 08, 2018 at 20:04 UTC | |