A framework for few-shot evaluation of language models.