# for each page, # romanize anything you can romanize # remove subjective words if possible # canonicalize the page, strip formatting # remove all low-value words # dissolve the page into letters # form an absolute 26-space vector from remaining letters # do NOT normalize the 26-space vector # save the 26-space vector for this page