in reply to Adding cols to 3d arrays - syntax

Dear all,

Thanks enormously for the interest. Firstly I used your pointers to fix what I was originally trying to do and ran it tonight, which revealed essentially a null hypothesis. Secondly now I have described my problem a bit more any ideas as to how to persue it further are much appreciated, @Jcb.

I should add detail since you guys are working with little. The disk, a 256 GB SSD, died and 2 recovery companies who apparently own the best gear in the business for recovering it (PC3000) said they had no luck with it (although I wonder if they were upfront about their equpiment but...). I switched it into 'engineering mode' and was able to communicate with it - running its default barefoot indilinx firmware that does not know anything about the layout and type of the NAND its connected to. Configuring the controller over the SATA bus was possible using the OpenSSD project which was luckily made for this controller and trial and error plus reading of datasheets allowed me to image the drive.

The total NAND capacity is significantly larger than the drives as sold size and I can read it all, apart from the 'spare area' which is it seems used for ECC rather than data. ECC and descrambling is performed transparently by the controller once asked to do so. So I have an image around 273 GB in size.

In the image are pages 16kb long with text, html etc in them. So the descrambling, join by byte and other complicated stuff is being done by the controller now I have found the correct config to send it. Its just a matter now of putting the 16kb pages (or blocks, terminology uncertain to me) back in order.

Structure of the disk is as follows. Erase blocks are 128 * 16kb pages - blocks must be erased in chunks this size. This moulds the drives way of working. Minimum writable size is 16kb and needs to be done in a single operation. So collections of 128 pages are important units in the drive because this is the minimum erase size.

It seems that block 128, the last one in each erase 'chunk' (128 * 16kb), contains what is clearly logical block numbers (LBAs). Their range is 0-16M or so in contrast to the 'user data' part of the disk. Looking closely at this last 16kb block it contains a 2 byte number which is the pyhsical block number / 4096, then a 16 byte bit field (lets call it bitfield_1) which I have figured out contains data to mark 16k blocks in this erase size block as invalid (when they have been written more than once in the same erase size block). Then there is 127 LPAs. Then there is another bitfield, call it bitfield_2, we will come back to. Then there is a variable number of 4 byte words again, and they are all in the range of LBAs or PBAs (0-16M), lets call this address_field_2. This is what I am currently investigating today, I feel it must be for something but I can't work out what.

Bitfield_2 appears to link to this second collection of addresses - perhaps it marks them valid or not like bitfield_1 does for the LBAs (its bitwise length corresponds to the number of addresses in that appear in this second address area).

I modified the stackbd kernel module to allow me to provide a map file to it and map the disk image file I made to a virtual drive according to the addresses in the LBA area. This worked well and I see large files many blocks long now in the mapped virtual drive. However around 4% of the LBAs appear more than once in the LBA area (IE the last 16kb block of each 128*16kb erase block). The duplicated LBAs must be from stale erase size blocks. SSDs move blocks around and do copy on write, meaning they simply leave the old blocks there when the OS modifies them, and copies them with changes to a new location. But they don't delete the old blocks and they can't mark them as stale. Somewhere the disk must be storing info as to which are stale, but I can't find it yet. I have recovered a fair amount of the data on this disk, large files, 10MB photos etc, perhaps 30% of it, showing that I am almost there, but these last 4% of blocks must be important blocks (directories?) that are preventing the full recovery.

My first feeling was that there should be a sequence number for each erase size block, permitting ordering of the blocks by when they were written and hence revealing the valid LBAs, but I can't find it, and perhaps if the drive has multiple erase size blocks open at once that method would not work anyway.

Alternatively it would perhaps make sense, when copying a logical block to a new location, to use address_field_2 in the new erase zise block to store the physical addresses of the logical block that the new block (written in this erase size block) has superceeded. That was my theory today but I could not find any evidence of it by manually checking data so I wanted to scan the whole disk address data for it. However I can now say there are far too few matches, zero exact matches and too few rough matches to indicate this might be correct.

A couple of years ago when I started this I had a thread on hddguru that explained a bit more if anyone wants to take a look

https://forum.hddguru.com/viewtopic.php?f=10&t=35428&mobile=mobile

The drive was unfortunately formatted HFS+. The translated image I have does not mount, but hfs rescue recovers many files from it. I did wonder if my current approach fails whether I could hack the code to mount to allow it to try different logical->physical mapping during the process of attempting to mound the drive. IE keep track of what LBA it accesses and where it decides to abort and keep retrying different physcial addresses from the list of possible PPNs for each LBN until it succeeds, and continue in that way until it has successfully mounted. Any comments on such an approach would be interesting.

Thanks, Pete

Replies are listed 'Best First'.
Re^2: Adding cols to 3d arrays - syntax
by jcb (Parson) on Sep 20, 2019 at 02:21 UTC

    Another step, now that we know how big this image is, would be to preserve the index arrays (@PPN, @LPN) on disk using Storable after loading them. Then you can reload them quickly instead of scanning the entire 273GB image just to build up indexes before being able to actually look at anything.

    The patterns in the "bank 32" map pages look suspiciously like flags of some type, likely for the preceding 4096 (=128*32) pages, with 8-bit or 16-bit values being possible fits. Is there a correlation between those values and duplicated LPNs? Perhaps only one of the LPNs in a duplicate set pairs with a particular value, suggesting that it is the valid copy? If "bank 32" holds a flag array, it is possible that the same space in other pages does in fact hold LPNs — whatever fragments of the LPN tables happened to be in the controller's memory when those maps were last written. We already know about the quality of the controller firmware, since the drive is dead.

    And to everyone else reading this: since that other thread mentions why we are after this data, I do have to hold this up as an example to others of why you should put backups of important things like baby pictures on write-once optical media from a reputable manufacturer; do not use the cheapest bargain-basement garbage you can find. (The last detail is from a different case of data loss: many of Barack Obama's earliest speeches were recorded only by random members of the audience using direct-to-DVD-R camcorders — on cheap media that was found to be unreadable only a few years later after he had been elected President of the United States.) At the very minimum, store at least some backups on "spinning rust" hard disks; the technology is mature and very reliable. Avoid putting all your data in flash.

    Learn from peterrowse's misfortune. The standard backup rule is "3-2-1": at least 3 copies, using at least 2 different storage technologies, with 1 off-site. ("Cloud" can be "off-site", but does not count as a storage technology, since you do not know how the data is actually stored, nor does it count as one of the 3 copies, since it can also disappear without warning.)

      First re the backup, and woeful lack of it in my case - this is so so true. So much effort could have been avoided if I had. The sad thing is that this is not the first time I have experienced significant data loss due to hard drive failure. In this case although I had previously used backups, I mistakenly thought SSDs were super robust, and had moved a load of new data (these photos mainly, a few thousand of them) from camera SD cards to the SSD while I organised it ready for moving to the backup machine. I then reused the SD cards :-( The 'organising' project took longer than expected due to family illness and I forgot the vulnerable data sitting on the SSD, thinking it 'safe as houses' anyway. Meanwhile the power supply developed a fault. Only then did I discover SSDs pitfalls.

      Even so my old backup solution was, I can see now, not good enough. A fire or lightning strike would for instance destroy all my data, but also many other scenarios might. I now have a system which keeps 5 copies distributed over 2 locations several miles apart, using different OSes and formatting systems. One is usually offline. I think I still have holes in the system though and am looking to change a few aspects of it. I use spinning rust now, because it can usually be recovered from, certainly a lot easier than SSD. SSD tech I now realise is a complete nightmare, since a power failure at the wrong time can cause what I have - an extremely difficult, if possible at all, to recover drive.

      As for optical media, I am wary of it, having had issues in the past with copies becoming unreadable several years later. Maybe poor quality as you say though. The other problem with it is I have around 3TB of to me very important data, once you figure in the video taken over the last few years, and optical with its small size takes time to use and make several copies of. I like HDDs now because they are very large, and as you say they are extremely mature. Data recovery companies usually have excellent success with them if required, the only severe failure mode is physical damage to the disk and with multiple copies in more than one location the chances of all having physical damage is very low. And they can be tasked to do their job reliably even when I am in periods of life when I am not being reliable!

      A second storage technology would be nice but I wonder about the reliability of the higher density stuff. I must admit I didn't know some can hold up to 128GB before looking it up just now, and its something to look into certainly. A few optical backups on some media which can be trusted would be certainly nice to have.

      Ill get back to the technical now in another post.

        There is one more advantage to optical media — it is the only commonly available format that is (or should be) entirely waterproof. Optical discs should retain data even after a flood — clean the mildew off and the disc reads fine. (Again, quality is important here, since poor quality discs might not be properly sealed, leading to "laser rot" even if not exposed to water.) This may or may not be relevant to your risk model, and 3TB is a very large amount of data.

        I have so far avoided the "unreadable several years later" problem by using good quality media from reputable manufacturers that I buy when the stores put it on sale (usually almost half-off if I am patient). Since I buy the blanks when they are on sale, I have a significant personal stock that I slowly rotate, and I suspect (and hope) that the blanks that will go bad will go bad before I get around to putting data on them. So far, this strategy has worked and I have yet to retrieve a disc from storage and find it unreadable, although I have had many discs fail verification immediately after writing them. Always read back an optical disc immediately after writing it — do not expect the drive to notice that the blank is bad while it is busy writing data.

        It is probably best to rank by importance (favoring more copies on lower-density media) and bulk (requiring fewer copies on higher-density media). This means the data with higher bulk-to-importance ratios (like high-def video) is exposed to greater risk of loss, but one partial mitigation is to store lower-resolution more-compressed copies of those videos in lower-density "bands" in your archive. I still use CD-Rs for some backups, even though I mostly use DVDs now. (But I do not have a significant collection of video.) So you might have full high-def video stored only on spinning hard disks, but lower-resolution "better than nothing" transcoded copies on DVDs or BDs.

        By now, you have probably learned better than to consider SSDs as valid backup storage. :-) (But they could still be a 3rd technology holding a 4th copy.)

      So re the use of Storable, what I currently do, since I have mainly been using C up until now for the lower level stuff like accessing the disk itself, is to use a cut down disk image for holding the data. I'm probably still thinking more in C terms than perl though. So I took all the 16kb LBA blocks (131072 of them), chopped off the last 14kb or so which always contained just 0xFFFFFFFF, and wrote that to a file around 250MB in size. Then I just read it back in when I need it, not always wholly though, I might just scan a few fields using seek and pop them into an array and then seek to disk locations to get the data I need as I process it. Since theres perhaps 40 million values there, whether this approach or storable would be faster I don't know, your opinion would be appreciated. I am running the analysis on an SSD (will I never learn?! Its all backed up:-)) so seek times are short. I've assumed storable uses text to store rather than raw binary and that the overhead of this would be large, but maybe thats a false assumption.

      Now as for the bank_32 business. As you say it does look like it does something significant. It doesn't look random enough for things like block write count, and the pattern seems to suggest its mapping something. I should extract those bank_32 fields and take a closer look, and I'll post some of it up here. The fact that they stored the LBAs in such a logical place, with a bitfield showing validity, a physical block start address etc all is very sensible and simple. If this was the style of the coders doing this it seems they would have put that last little bit of data somewhere accessible too. Thats not to say I am crediting them - I think the firmware choking when it sees the bad block area corrupt is poor, but they do seem to have designed the LBA id area fairly well.

      Perhaps as you say the bank_32 lpn area stores data relevant to the other banks in this superblock. I can't remember the exact layout of the NANDs and the rules for writing them but the rules are restrictive re order. Maybe bank_32 needs to be the last written out of the superblock. If so it would make sense to write up to date validity data in this zone for the superblock (This is what you are saying I think).

      There are 2 other areas which are worthy of thought too. One is the space for LBA 128 in the lba area, IE the last LBA 'slot'. It corresponds to the lba area itself, which is obviously redundant. It is not empty and not FFFFFFFF and it must do something. IIRC there is also another 4 bytes after this which frequently (always?) is a copy of LBA 128. I should poke around with this a bit more to see what its characteristics are.

      And then there is of course this second LBA area, since it does not correlate in the way I hoped with the LBAs, what the heck is it. I wonder if I am looking at it wrong - perhaps there is a fixed offset between its numbering and the one I am using for blocks or something. It seems that stripping down the data for some duplicate LBAs and seeing if I can see any patterns manually again might be worth it at this point.

        Storable is an XS module that (quickly) serializes and unserializes Perl data structures to and from its own binary format. The idea is to build the @PPN and @LPN indexes once and then save those as (presumably much smaller) files alongside the image. Actual usage is to read the index arrays back in full, then open the image file and seek/read/unpack only the data that you need for each analysis from the full image.

        For efficiency, the controller is likely to batch writes until it has a full erase block and only then "flush the buffers" out to the NAND array, and there may even be structures larger than an erase block that are significant to the FTL. The odd "bank 32" data hints at such a structure. How long is that apparent field?

        If the FTL uses a log structure, the "LBA 128" field might be the write sequence number you have been looking for. The nonsensical "LBA" list may simply be garbage, "unused" space that gets written with whatever happened to be in the controller's memory when writing the block. In other words, it may be a list of LPNs, but not LPNs that are relevant to the current state of the NAND array. Or, in C terms, the contents of an uninitialized buffer.

        Also, a small note about this site: there is a "reply" link for each post, and your post appears as a child of that post if you use it, instead of appearing at the top-level in the thread. PerlMonks also notifies the author of the post you replied to when a reply is made in this way. Please use it. I will request to have this subthread reparented, but please try to maintain the threaded nature of the discussion. The "reply" link for this post should appear to the right of this paragraph. --->