Been away for a few weeks: in the meanwhile, my computers have been happily synthesizing.
Without going too long on the details, the sophisticated cephalopod in the illustration was generated by a VQGAN (“Vector Quantized Generative Adversarial Network”), as described in this recent paper: “Taming Transformers,” and guided by OpenAI’s CLIP (“Contrastive Language & Image Pretraining”), as described in this other recent paper, “Learning Transferable Visual Models From Natural Language Supervision.”
The two schemes were promptly connected together by math-savvy artists and explorers and shared via python notebooks: a charge led by Katherine Crowson, aka RiversHaveWings.
Not-quite-tweet-sized explanation: Image descriptions can be learned as part of learning a language itself (CLIP). Using such a bi-modal model, image generators (like VQGAN) can then find images (err, arrays of pixels) that align with the perceived “meaning” of a text description. Thus: “elegant octopus serving tea” has the image above as one approximated “solution.”
This is a very quick post. There are many, many other uses for these connections, and I’ll be posting more of them.