in reply to Similarity searching

17 comes before 12 or just a typo?

hist:x 1 ## 4 #### 5 #### 17 ########## 12 #
I cannot use any known database engine

Really? Why not?

If true, the first thing I would do is write a short program to do a single pass over your 2e9 silly format files and output a single file formatted like so:

1: 1(3) 3(4) 5(7) 17(1) 21(1) 2: 1(2) 3(2) 17(5) 20(1) 22(2) 3: 3(1) 10(3) 12(1) ...

Then I could dump all those silly format files.

Then I'd look to reformat that single file into some kind of consistent record format, but then I read this bit of your description:

each in real case scenario containing approx 300 columns and there is a maximum of 8000 possible column labels(values)) i thought i should create a consensus histogram from all subject ones. such that this histogram has all 25 columns (now i am again talking about my example) and each column having the maximum number of data points (this is computed from the subject set- if the max number of data points for column 1 is 100 then this how large column 1 in my consensus hist will be.)

And got completely lost in the number of columns and ranges of values for each column...


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^2: Similarity searching
by baxy77bax (Deacon) on Jan 25, 2014 at 16:05 UTC
    "17 comes before 12 or just a typo? "

    typo! they are always ordered.

    "If true, the first thing I would do is write a short program to do a single pass over your 2e9 silly format files and output a single file formatted like so:"

    yes it is already in some "nicely" formated style but for the purposes of visualization i used histograms (i thought it vould be easier to understand the problem).

    "And got completely lost in the number of columns and ranges of values for each column..."

    so each histogram can have up to 300 columns. let say each column label is a number, then 300 does not imply that columns are labeled from 1 to 300 but the label range is 1-8000. from those 8000 labels each histogram has at most 300 different labels. (# of different ways you can pick 300 out of 8000 at most) the size of the column does not have a maximum value.

    i hope i clarified the problem a bit.

    cheers

      i hope i clarified the problem a bit.

      Lots :)

      But ... "the size of the column does not have a maximum value.". Even if there is no set upper limit, there is obviously a maximum value as found within your dataset. How big is that number?

      (The exact value is of little importance here, but the scale of the value will be very informative.)


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        yea i see, in my real data set the biggest number is 250 (so the number of # symbols never passes 250). so they are not that big