in reply to Adding cols to 3d arrays - syntax

Your problem is not one of referencing your data structure, but rather one of creating it. I think you want a stucture that you can reference as:
$date = $myarr[$index][$subindex];

where $date is a reference to an array of ten integers and ($index, $subindex) is a pair of integers each in range 0..99.

With your data, you cannot assume that $index, $subindex, or the pair ($index, $subindex) are unique. The best solution very much depends on how you need to handle these special cases. I would only be adding to your confusion if I attempted to answer your original question before you clarify this issue.

The use of an array-of-arrays, as the reference above requires, has another problem. The arrays are very sparse. Refering to your data, the entire structure only stores fifteen dates. This is not a serious problem yet, but it would not scale up well if your array size were to increase. We seldom use hashes with integer keys, but it seems to be the perfect solution to this problem.

Bill

Replies are listed 'Best First'.
Re^2: Adding cols to 3d arrays - syntax
by peterrowse (Acolyte) on Sep 19, 2019 at 17:09 UTC

    Hi all,

    Thx for replies. I won't have a chance to try the suggestions until later tonight but Bill your point re data structures is right, its a bit of an unusual lot of data (to me anyway) and I have struggled to decide how to approach it.

    Its actually the 16kb sectors of a SSD which broke and I retrieved an image of the NANDs from. So I have a 'physical image' of the drive. However the translation table is partially lost and I am searching for patterns. There are around 16 million sectors ($lpn) and each lbn is sometimes present in many places (up to 20000 for a very few examples, usually 2 places). Each sector has associated with it as well as an ID I have figured out is its lba, up to 128 numbers which might indicate its validity (they might point to expired physical locations but I am not sure yet).

    So I need to search it for patterns after I load it into some kind of structure. It will take a lot of memory.

    I tried at first with hashes but could not get it working and after thinking about it more decided arrays would be more appropriate (although very sparse as you say).

    I need 3 levels of storage, logical address ($lpn - unique), physical address ($ppn, once again unique), and inside each $ppn I need 'sub PPNs' lets call them $subppn, I don't know how unique they will be per $ppn, maybe not. Every element will be a 32 bit integer (in fact only uses 24 bits).

    So each $lpn owns several $ppns, each of which owns a list of $subppns.

    Structurally what this respresents is PPNs are the actual physical location of the sector, LPN is the logical location that I believe is assigned to this PPN, and subppn is a list that each PPN owns of locations I need to analyse to see what pattern emerges in relation to the PPN and LPN. Many PPNs point to a single LPN sometimes.

    Then I need to do:

    foreach $ppn { foreach $ppn { # do stuff } }

    So in other words I need to for each $lpn, then for each $ppn, check the list of $subppns against all other $ppns in this $lpn.

    I hope that makes sense - I am sitting at a table with 2 screaming kids as I type this and having difficulty describing it. If it does not make sense I will provide a better description later.

    I guess ideally looking at my description I would like $structure{hash}{hash}array but I could not get it to work, for similar reasons to the one I have hit above (using scalar as reference etc). Edited to correct mistake when I said PPNs are only unique per LPN - this is not the case now I think about it. Thanks, Pete

      I will second NetWallah's suggestion for using a database here — it looks like you probably have too much data to fit in RAM all at once.

      That does not necessarily mean SQL; you could store image:offset tuples and have objects that abstract data retrieval from disk as-needed. The big problem that I see is that you are looking at possibly (224)*(224) arrays, which are a few hundred bytes each in AV overhead. Do you have 248 ~ 256TiB RAM?

      Start by what we actually, no question about it, know: the PPN for each page, which we know because it is the address where we read that page from the NAND array. Further, the PPNs have a contiguous range. So we can make an array of hashes: (see also perldsc)

      # array mapping PPN to hash of info key => value # element keys: # LPN => logical page number stored at this block # XPPN => array of unknown numbers found with this block my @PPN = ();

      We probably have the LPN for each block, but we have no guarantee that LPNs in the image are contiguous, and we already know that many PPNs store different versions of the same LPN. However, LPNs are probably mostly contiguous, so we can still use an array, this time of arrays:

      # array mapping LPN to array of PPNs that store copies # each element is an array of PPNs my @LPN = ();

      Loading these arrays should be relatively simple:

      use constant NAND_PAGE_SIZE => 16384; # or whatever... { open IMAGE, '<', $image_file_name or die "$image_file_name: $!"; local $/ = \NAND_PAGE_SIZE; while (<IMAGE>) { my $ppn = (tell IMAGE) / NAND_PAGE_SIZE; my $lpn = get_LPN($_); $PPN[$ppn] = { LPN => $lpn, XPPN => [ get_XPPNs($_) ] }; push @{$LPN[$lpn]}, $ppn; } close IMAGE; }

      Hopefully this helps...

        Dear all,

        Thanks enormously for the interest. Firstly I used your pointers to fix what I was originally trying to do and ran it tonight, which revealed essentially a null hypothesis. Secondly now I have described my problem a bit more any ideas as to how to persue it further are much appreciated, @Jcb.

        I should add detail since you guys are working with little. The disk, a 256 GB SSD, died and 2 recovery companies who apparently own the best gear in the business for recovering it (PC3000) said they had no luck with it (although I wonder if they were upfront about their equpiment but...). I switched it into 'engineering mode' and was able to communicate with it - running its default barefoot indilinx firmware that does not know anything about the layout and type of the NAND its connected to. Configuring the controller over the SATA bus was possible using the OpenSSD project which was luckily made for this controller and trial and error plus reading of datasheets allowed me to image the drive.

        The total NAND capacity is significantly larger than the drives as sold size and I can read it all, apart from the 'spare area' which is it seems used for ECC rather than data. ECC and descrambling is performed transparently by the controller once asked to do so. So I have an image around 273 GB in size.

        In the image are pages 16kb long with text, html etc in them. So the descrambling, join by byte and other complicated stuff is being done by the controller now I have found the correct config to send it. Its just a matter now of putting the 16kb pages (or blocks, terminology uncertain to me) back in order.

        Structure of the disk is as follows. Erase blocks are 128 * 16kb pages - blocks must be erased in chunks this size. This moulds the drives way of working. Minimum writable size is 16kb and needs to be done in a single operation. So collections of 128 pages are important units in the drive because this is the minimum erase size.

        It seems that block 128, the last one in each erase 'chunk' (128 * 16kb), contains what is clearly logical block numbers (LBAs). Their range is 0-16M or so in contrast to the 'user data' part of the disk. Looking closely at this last 16kb block it contains a 2 byte number which is the pyhsical block number / 4096, then a 16 byte bit field (lets call it bitfield_1) which I have figured out contains data to mark 16k blocks in this erase size block as invalid (when they have been written more than once in the same erase size block). Then there is 127 LPAs. Then there is another bitfield, call it bitfield_2, we will come back to. Then there is a variable number of 4 byte words again, and they are all in the range of LBAs or PBAs (0-16M), lets call this address_field_2. This is what I am currently investigating today, I feel it must be for something but I can't work out what.

        Bitfield_2 appears to link to this second collection of addresses - perhaps it marks them valid or not like bitfield_1 does for the LBAs (its bitwise length corresponds to the number of addresses in that appear in this second address area).

        I modified the stackbd kernel module to allow me to provide a map file to it and map the disk image file I made to a virtual drive according to the addresses in the LBA area. This worked well and I see large files many blocks long now in the mapped virtual drive. However around 4% of the LBAs appear more than once in the LBA area (IE the last 16kb block of each 128*16kb erase block). The duplicated LBAs must be from stale erase size blocks. SSDs move blocks around and do copy on write, meaning they simply leave the old blocks there when the OS modifies them, and copies them with changes to a new location. But they don't delete the old blocks and they can't mark them as stale. Somewhere the disk must be storing info as to which are stale, but I can't find it yet. I have recovered a fair amount of the data on this disk, large files, 10MB photos etc, perhaps 30% of it, showing that I am almost there, but these last 4% of blocks must be important blocks (directories?) that are preventing the full recovery.

        My first feeling was that there should be a sequence number for each erase size block, permitting ordering of the blocks by when they were written and hence revealing the valid LBAs, but I can't find it, and perhaps if the drive has multiple erase size blocks open at once that method would not work anyway.

        Alternatively it would perhaps make sense, when copying a logical block to a new location, to use address_field_2 in the new erase zise block to store the physical addresses of the logical block that the new block (written in this erase size block) has superceeded. That was my theory today but I could not find any evidence of it by manually checking data so I wanted to scan the whole disk address data for it. However I can now say there are far too few matches, zero exact matches and too few rough matches to indicate this might be correct.

        A couple of years ago when I started this I had a thread on hddguru that explained a bit more if anyone wants to take a look

        https://forum.hddguru.com/viewtopic.php?f=10&t=35428&mobile=mobile

        The drive was unfortunately formatted HFS+. The translated image I have does not mount, but hfs rescue recovers many files from it. I did wonder if my current approach fails whether I could hack the code to mount to allow it to try different logical->physical mapping during the process of attempting to mound the drive. IE keep track of what LBA it accesses and where it decides to abort and keep retrying different physcial addresses from the list of possible PPNs for each LBN until it succeeds, and continue in that way until it has successfully mounted. Any comments on such an approach would be interesting.

        Thanks, Pete

      To my way of thinking, the requirements to solve this are best met using a Database (eg. SQLITE).

      I read the requirements as 3 x one-to-many structures (tables) - and SQL can easily accommodate the queries to find your patterns.

                      "From there to here, from here to there, funny things are everywhere." -- Dr. Seuss

      It now appears that your problem is much more complicated than your original post implied. I cannot help you because I do not know anything about SSD technology and cannot infer enough from your post. If no one else can help you at this point, your next step should be to find and post a link to documentation for the applicable technology. Then update you example using the exact vocabulary of the reference for function and variable names. Tell us clearly what data you have recovered and what you have to recreate. (Are you even sure that this task is possible?)
      Bill

      Hi again,

      Maybe this rudimentary implementation will help:

      use strict; use warnings; use feature 'say'; my %data; for (0..14) { my ($lpn, $ppn, $cnt_subppn) = get_ns(); push @{ $data{$lpn}{$ppn} }, (0..$cnt_subppn); } for my $lpn ( sort {$a <=> $b} keys %data ) { say "$lpn:"; for my $ppn ( sort {$a <=> $b} keys %{ $data{$lpn} } ) { say " $ppn:"; for my $sub_ppn ( sort {$a <=> $b} @{ $data{$lpn}{$ppn} } ) { say " $sub_ppn"; } } } sub get_ns { return (int(rand(10)), int(rand(10)), int(rand(10))); }
      Note adds multiple entries for the same "sub_ppn"; you could count instead.

      Example output:

      0: 7: 0 1 2 3 4 5 6 7 8 9 2: 2: 0 1 2 9: 0 1 2 3 4 3: 5: 0 1 2 3 8: 0 1 2 3 4 5 6 7 8 4: 3: 0 1 2 3 4 5 6 7 8 7: 0 1 9: 0 1 2 5: 2: 0 1 2 3 4 5 6: 8: 0 0 1 1 2 2 3 3 4 4 5 6 7 8 9 9: 0 1 2 3 4 5 6 7 8 7: 7: 0 1 2 3 4 5 6 7 8: 4: 0 1 2 3 4 5 6 9: 3: 0 1

      Hope this helps!


      The way forward always starts with a minimal test.