OpenAI has recently taken over the tech circles, surprising everyone with their code generation models. A few weeks ago, I received access to their Co-Pilot and I'm surprised how much boilerplate it can predict. Due to the non-triviality of my day-to-day code, having Co-Pilot write my suggestions doesn't seem to help me much. On the flip side, Co-Pilot shines in assisting me to write code for research. It can recite torchvision
ResNet from a one-word prompt, oftentimes even without any mention of ResNet(I imagine this being the case because ResNet could be the most recurring code on GitHub after import torch
). So, it can be safe to assume that after acquiring a lot of users and a lot of engineering efforts, the Codex can be tuned to do some trivial tasks almost perfectly. So how does this translate to apps that are powered by this technology to an average user? How do the UI/UX design principles have to be established?
This post is partly inspired by the above Codex demo. As you can see, Codex generates code that is executed at runtime based on the user's natural language input. While this in itself is impressive, imagine all the apps powered by a GPT-* model having some sort of natural language being turned into code that does something on the app to help you out - this introduces new challenges in terms of UX. In apps like these, streaming data is a first-class citizen to any direct changes on the app and unlike current apps, the data returned might not always be perfect. For example, in an instance where for some reason, if red was inferred as bread, there's an additional confirmation and correction delay introduced which makes the UX worse than just doing the task yourself. This introduces a paradigm where prediction latency and user expectations must be synergetic and can be elegantly handled via strong design principles. This paradigm is slightly different from saying Siri/Google assistant where the actions performed are essentially a smart way of executing code humans have written. I predict this is a start in a transition to apps that stream natural language input to have the desired output with 5G's adoption growing now more than ever, complementing these scenarios.
So, how does this fit into the general Human-Computer Interaction(HCI)? This brings me to another class of models, Perceiver IO by DeepMind. Perceiver IO introduces techniques to work with multi-modal inputs - language, images, audio, etc. all leveraged by a single network given the right task and dataset at hand. Imagine now a model having the ability produce generate outputs based on different sources of input and all of them synergetically having the ability to give a user what they want! What if the next Codex built on GPT-x looks at your face and based on the natural language prompt, outputs code depending on your mood, etc.(personal note: I did get irritated when Co-Pilot just gives random output while I'm thinking) While this is an absurd example, the possibilities extend to how a robot perceives different inputs - human, environmental, etc. all with one model are endless!! A large-scale GPT with non-blocking multi-modal abilities can certainly have a much broader impact in terms of human-robot interaction as opposed to the current limited human-edge device interaction. Non-blocking in this context would be the model would still be able to perform some things while missing data from one of the modalities it supports. HCI would here include safety, ethics, etc. all as primary focus points rather than just something in the background.
These models might just bring the revolution in design principles around HCI and UI/UX with future apps and developer tools that directly work with natural language input ease the entire experience with possibly more than just language!