Note: i built a webapp around an image captioning model as demoed later in the blog post. I wish I could release the app to allow others to try it but given the current closed nature of OpenAI GPT3 keys, the limitations on token usages for the private beta users, and the process one has to go through to release anything to public, for now I have decided to keep the app private.
I was recently fortunate enough to get beta access to the much talked about OpenAI GPT3 generative language model API. In a short span of 5-6 months there’s already an incredible amount of online literature ranging from technical deep dives of the paper itself Language Models are Few Shot Learners to every possibly imagined test of text generation (e.g., gwern).
If you haven’t heard the news yet or seen the many videos across the internet, here’s a quick rundown:
The largest GPT3 model was trained with a 175B params model
Example of a GPT3 generated passage that humans had difficulty distinguishing
From reading the paper results (and any cursory test of the API itself), it’s easy to see that the model performs well in a variety of tasks whereby you plainly feed it text, and likewise output text. The paper tests a variety of common NLP benchmark tasks, flexing on both fine-tuned, tasks specific SOTA models and human beings (apparently college students on average are not too great at SAT analogies).
Similarly, for the past 4 months hobbyists have extensively tinkered, and companies have been made, from experimenting with applications of plain text generation, often in a creative context (e.g., AI Dungeon). VC-raising startups have been made around text summarization and generation. Someone tried to get GPT3 to tell them the meaning of life and the universe.
However, some people have found more interesting applications of outputting text through GPT3 that ends in a output that’s something else:
This idea of raw text → GPT-3 → some structured data → some tool that builds from the structured data seems like an incredible concept with amazing potential.
Likewise, there have been experiments where the input into GPT3 itself was sourced from beyond text, but I haven’t found many great examples outside of voice to text, for example: Podcast with GPT3 - this is honestly a little gimmicky but it in a way you can think of the flow as voice → text to voice api → gpt3 api → voice to text
Naturally, this led me into thinking about how to incorporate images as an input - how could we give “vision” to a “blind” model? I came across a GPT2 project whereby someone translated training data image pixels into text representation, trained the model to recognize this structured data format, and generated new Pokemon sprites. Unfortunately the GPT3 model is not open sourced like GPT2, and as of yet, there is no way to tune a custom dataset to such a custom representation of images. Ok then, what if I somehow describe what is in the image, and use GPT3 to build on top of that description?
Examples from a popular image captioning training dataset
There is an entire body of literature and research on image captioning and other cross-model representation related to vision + language tasks, though even SOTA today seems to be a bit hit or miss. Just for the sake of having a quick working model I decided to spin up an existing repo of Self-critical Sequence Training for Image Captioning, CVPR17 though at the time of writing much more modern, transformer based models exist such as Microsoft OSCAR (Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks).
The approach is fairly straightforward: feed into GPT what the captioning model outputs. Presumably GPT will take a plain description, and add some flair, depending on the seeded prompt. A couple of quick notes:
Illustrative flow of image through the captioning app
I mocked up a quick web app to demonstrate a simple captioning engine. As seen from the demo, some examples worked better than others.
Here are a few additional images I ran through the app.
“They all face a huge problem, they don’t have balls”
“The woman is outraged that so many people are facing starvation while she can afford this luxury”
“It wasn’t even worth it”
“A young man sits in the dark bedroom of his autistic brother, who is sitting up and staring apprehensively at the computer”
“She knew that it would be her last surfing adventure”
From my couple weeks of experimentation, it does seem like GPT3’s primary source of power is scale, a result of compiling the largest corpus of text, into the largest, most complex neural net. Fundamentally it’s unclear how much “logic” in a plain human-like sense, the model has actually retained. The idea of translating text into structured data for build purposes is a prime example to test the models capabilities (e.g., can you logically write code, do math, do X task that requires more than just pattern matching from a large set of examples?). It will be interesting to see where people take this notion, and hopefully I’ll get to do some more experimentation in a part II.
Special thanks: