account for idioms

currently our text is treated as a mindless list of tokens. in reality, most text consists of idioms interspersed with less likely stuff.

see if we gain any benefit by trying to identify partition text into idioms and non-idioms.