in reply to Re: Similarity searching
in thread Similarity searching

"17 comes before 12 or just a typo? "

typo! they are always ordered.

"If true, the first thing I would do is write a short program to do a single pass over your 2e9 silly format files and output a single file formatted like so:"

yes it is already in some "nicely" formated style but for the purposes of visualization i used histograms (i thought it vould be easier to understand the problem).

"And got completely lost in the number of columns and ranges of values for each column..."

so each histogram can have up to 300 columns. let say each column label is a number, then 300 does not imply that columns are labeled from 1 to 300 but the label range is 1-8000. from those 8000 labels each histogram has at most 300 different labels. (# of different ways you can pick 300 out of 8000 at most) the size of the column does not have a maximum value.

i hope i clarified the problem a bit.

cheers

Replies are listed 'Best First'.
Re^3: Similarity searching
by BrowserUk (Patriarch) on Jan 25, 2014 at 16:19 UTC
    i hope i clarified the problem a bit.

    Lots :)

    But ... "the size of the column does not have a maximum value.". Even if there is no set upper limit, there is obviously a maximum value as found within your dataset. How big is that number?

    (The exact value is of little importance here, but the scale of the value will be very informative.)


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      yea i see, in my real data set the biggest number is 250 (so the number of # symbols never passes 250). so they are not that big