in reply to Re^7: Adding cols to 3d arrays - syntax
in thread Adding cols to 3d arrays - syntax
While inversion is common in flash interfaces, you have also seen large blocks of 0xFF, so I doubt that is going on, unless some pages are inverted and others not. My guess would be that the drive writes all bits clear before erasing as part of wear-leveling for some reason?
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^9: Adding cols to 3d arrays - syntax
by peterrowse (Acolyte) on Sep 22, 2019 at 09:54 UTC | |
Yes I guess thats the only likely reason. Perhaps checking for bad blocks - the datasheet says blocks will start showing errors as time goes on. The chips only recognise errors in cells that are written to 0 according to the datasheet (understandably) so writing a page of 0s checks cell integrity before the reuse of each block (this will happen transparently to the controller chip). The datasheet states that 'the number of consecutive partial page programming operation within the same page without an intervening erase operation must not exceed 1 time for the page' which I take to mean the page can only be programmed once, but whether this is imposed by the chips themselves it does not say. If you don't care about the integrity of the data you could write 0s to the whole block over the previous data and save some cell transitions (meaning lifespan I would guess). The smoother grey sections of each column before the dark sections is curious and I need to investigate it further, but quick checks just now show data that looks fairly normal (not addressing, quite random). Must be something though. Re the log structure of a log structure disk, initially I was thinking this would be far too inefficient a way to use the disk because blocks that didn't need to be copy-written because they had not changed would require moving as much as the rest of the disk. But is it the case that such a compromise is indeed made with this method? It seems to prioritise wear levelling over total life span. A more complicated system could surely do a better job in terms of total write cycles per block needed (I imagine without having thought about it too deeply). Would a log structure really be this simple, a pure ring buffer (per each of the 32 chips)? Its certainly an absolute simple as anything way to achieve 100% perfect wear levelling. But the cost seems very high. Perhaps I can accurately determine the exact point the 'new data' starts for each column (they are not even) and write a map file to reflect this and see if the file system is in better shape as a simple exercise. | [reply] |
by jcb (Parson) on Sep 22, 2019 at 21:25 UTC | |
No, if the datasheet says the block must be erased after one page write, issuing another page write before an erase is a good way to destroy some flash cells. A double write does not "save some cell transitions" but rather gives a good chance that that cell will not erase properly. This rules out any kind of incremental writes on this device, unless the firmware is really badly written. So "bank 32" is probably not (simple) validity flags. Those "smoother grey sections" might be the validity/log-state data. Presumably the controller maintains tables in RAM and flushes them to the NAND array at shutdown, possibly with some kind of checkpointing scheme ("bank 32"? the "extra LPNs"?) embedded in the map pages with normal writes. That this is not exactly robust against power failure does not rule out its use in this drive — we already know that the drive is not robust against power failure! Recopying live blocks as the "rewrite zone" approaches is what the early log-structured filesystems did, if I understand correctly. Trading a theoretical total life span for better wear leveling is not as absurd as it may sound — the SSD is dead as soon as it loses the last "extra" block anyway, no matter how many write cycles may remain on other blocks that are storing live data, since it can no longer provide its stated capacity and has no way to tell the host that the total space is dwindling, nor can most PC-ish (including Macs) filesystems handle storage devices that slowly dwindle away as SSDs do. | [reply] |
by peterrowse (Acolyte) on Sep 26, 2019 at 13:41 UTC | |
Well I've been busy hacking away at it for the last few days but no progress. I tried using Storable which worked well for arrays but when I tried to put the larger hashes into it (or reload them, I can't remember which) I kept getting out of memory errors and extreme slow downs (needed reisub once). A couple of weeks ago I found I had originally read the pages out in a slightly wrong order, changing that led to the larger text pages I was able to read. But I realised I scanned the second LBA area using the old addressing scheme (since I was looking for physical addresses rather than LPAs this wold have broken it). Running the scan again once again froze my machine repeatedly due to the extremely large hashes I suppose. Whether I coded it wrong or not I don't know but I gave up using perl for this and wrote it in c with mmap which took a while of course but it executes quite fast so it was worth doing. However running the scan again yielded nothing. I am looking here at whether the second LBA like array in each LBA block corresponds to any of the other rows sharing the same LBAs. So I find 2 references to a particular LBA in 2 different parts of the image. I then look in the second LBA like area in the LBA block for the physical address that I found the other reference to this particular LBA in. And I find not enough matches to be not from chance only. This might be because how I am assigning a physical address to each block that I read is wrong (IE the drive thinks of physical addresses differently to how I see it), or because there is no match. I just can't see what this second field is for though. I've done some more checking and have a decent description of the LBA block contents (each is for 127 data blocks of 16k)
Word refers to a 32 bit word. Each 16k block which is last in an erase size block of 128 blocks consists of: In the OpenSSD source they state that since the controller can't access the chip spare area they store LPNs in the last block in each erase block, but I recently noticed they say this is for GC, since the GC can compare the LPN with its in memory value and if it does not match it would know the block can be erased. This would be a fast and convenient method, so perhaps this is why the LPN is stored where it is. In the OpenSSD source however they structure they store is a simple single array, the second LBA area is not there. So I am still puzzled as to what it might be for. So I wonder if this LPN data (the first LPN field) is purely for GC, or might it serve a second purpose as crash recovery. In the first case it seems there will not be a sequence number or way of determining sequence since it is not needed, in the second case a sequence is needed. It would make no sense whatsoever to not record this number in the erase block, since you are writing it anyway and theres many kb free. But the only place I can find thats a possible match for a sequence number is word 133, but it does not seem like one, unless its as you mention not a simple sequence but some kind of derivative of one. Still, it seems there should be a map block somewhere on the disk for loading at boot time. Its only 64MB or so needed and theres bags of spare unused space. I'm looking around for that, I did find a couple of interesting areas with LBA or PBA range numbers but they are peppered with the odd number an order of magnitude larger. I am wondering whether these could be sequence numbers or similar in blocks of subsets of map data. IE if the drive is writing log style its only concentrating on one region of its address space at once, so only writing updates to reflect that (diffs in essence, addressed to one part of the map). If map files or fragments thereof are written out, I would imagine there are many of them, scattered around the disk. The blocks I am looking at do seem scattered around the 'smooth grey area' I mentioned. As you say, perhaps this is an area set aside for such use. And it could also include spare blocks so that as the NAND blocks fail it replaces them, and once it is used up the drive fails fully. There is lots of fully 0x00 erase size blocks in this area, why 0x00 rather than 0xffffffff I wonder (better to store cells in the 0x00 state?). Anyway thats where I am up to, not much to say but I thought I would update. Any ideas appreciated. | [reply] |