I’m a multidisciplinary designer able to create things that transcend the natural boundaries of different technology domains. For example, I’m very interested in hardware connected software. Owning that whole stack results in a controlled and therefore elegant user experience.
I was approached by a client about creating a unique captioning system for his venue.
The client conceived the idea after seeing a captioning system at the Metropolitan Opera house in New York City. He wanted to have in-seat monitors that would display live captions of what the speaker was saying on stage in the venue. This functionality is much more complex than what the Met Titles system offers.
Through research, I discovered that Met Titles system was installed with a 14 million dollar budget. We were working with a little bit less resources… But I’m great at creating amazing experiences, even on a budget.
The system in the Met works great because the performance is scripted beforehand and the operators know what the titles are going to say. The operators just need to keep the captions in sync with the performance. What about when a performer isn’t following a script?
My job was to design a system for the venue with a complete custom software and hardware integration that would provide real time automatic captioning to multiple patrons without the need for a skilled transcriptionist. In other words, multiple displays need to be setup to show subtitles or captions of the spoken word, all in sync and without a trained specialist operating the system.
I tapped into every one of my skillsets to develop LiveTitles. The creation process was so complex, I won’t be able to delve into every detail in one article, but here’s a bit more about how I did it:
Early on I ruled out tiny displays. I also wanted full color to be an option in case we ever wanted to leverage the displays for different uses. So I drew up some concepts for special wooden cabinets to match the decor of the seats and surrounding architecture in the venue.
First, I needed to prove we could send audio out to a transcription service and receive back accurate text captions with extremely low latency. What’s the point of captions if they are too far behind the audio that’s being spoken? That would just create more confusion for the users we are trying to serve with this product.
I experimented with a number of transcription services but ultimately Google Cloud Speech provided the best results. Although fast and robust, there are some limitations currently including handling for some accents and dialects and blacklisting certain words (we created this functionality on a separate layer on our end.) We’re in the early stages of Audio processing by computers. Anyone using Siri knows these AU systems are not perfect, but with ideal environmental conditions, they tend to be very accurate. With Neural Networks as foundation of these systems, they are constantly learning and automatically improving from input by all users of the API.
However, we segmented this API connection out into its own component to keep things modular, so that it can be changed to a different Speech API offering easily in the future if needed in the event that a different transcription API takes the lead with better service, more options or other features that suit the needs of the product better.
An on site server sends house audio up to the cloud for live processing, then sends the results back to the venue’s displays. The displays need to be synchronized as much as possible so every patron would have the same experience. We don’t want different text at even slightly different times for users. Every user should have, and therefore every screen should present, the same experience.
So I experimented with using a master screen that all the other screens followed, but that wasn’t an ideal design. Fortunately the data we were transmitting was small and using Google Firebase the transmission was so fast and with such minimal latency, that we were able to use simultaneous connections from each display directly to the cloud server. Each of the screens individually could pull from the server several times per second, with no synchronization problems.
For his specific use case, some parts of the presentation are spoken on the fly, off the cuff, but others parts are read from a scripted format. While the transcriptions are quite accurate, by my tests about 90% under prime conditions, I thought it would be best if the system also employed an option similar to MetTitles where known texts can be loaded in and synced up with a speaker for 100% accuracy on the words. The only variable would be timing.
Next, I needed to design software that could control all this. I wanted to use a tablet to be the “key” to the system. Everything can be controlled from one simple iPad App.
Ultimately, I decided for parts of the performance that were already known, we would automatically load text scripts that could be loaded from a web based admin panel.
I designed an interface where a user could swipe with their finger over words on the screen. When a swipe action is performed across a word, specifically from left to right, the way words are read in English, then that word gets pushed to the remote displays instantly.
So in one mode, automatic transcription services are fast and responsive but maybe not 100% accurate, but in swipe mode, operated by a human, the text would be perfectly accurate with potential timing variables minimized the better the user is matching the speed of the speaker while swiping the words. Keeping this swipe action on time is not hard to do, even for new users.
But it was important to me that the system be able to switch between modes on the fly, without any perceptible difference to users of the system or operators besides a simple switch to control the mode setting.
A requirement from the client was that the system be as automatic as possible. Basically, he just wanted to push one button to start the system and push one to end it. Ideally the system could start automatically at set times, but I decided that letting a computer decide when to send private audio out to the internet automatically without human interaction could open up some problems, not the least of which could be privacy issues, and potential unknown hot mic issues. Since this system could be used at varying times, it would be best to require a human to turn it on and off with minimal interactions. Therefore I opted for a manual start switch.
As a user, I hate when things go wrong with touch screens. I can name countless products that are built incorrectly with tiny actions causing terrible results. Even the latest Twitter iOS App can knock you way out of threads by misinterpreting a vertical scroll as a horizontal swipe gesture, and pushing you back a few screens. The action is so minimal, such a light touch can cause drastic problems in the experience. It’s way overboard, totally broken and I can’t believe it hasn’t been complained about or fixed yet. While anecdotal and maybe not the biggest deal in the world, these little UI bugs are present in many more products than just Twitter, and really manifest themselves as bad UX. I always go out of my way to design things that avoid these issues. Especially when the mission is critical.
For example: A simple button to turn the system on and off with one tap could be disastrous in a real world scenario. Think about it, the whole premise of this system is to augment an experience for people who can’t fully enjoy it without this service. If the service goes down, it pretty much ruins the experience for the user. So what if the iPad controller app is accidentally tapped through normal usage or just by chance? We don’t want the system turning off by accident and interrupting service. Likewise we don’t want the system on with hot mics when not intended either. So I designed this switch. Drawing from Apple’s “slide to unlock” user interface element, it’s a slider switch to start up the system. I because some other UI controls also user a slider to toggle between 2 different modes, I wanted to avoid the use of a switch to turn off the system. Having 2 sliders on screen at the same time could not only cause user confusion but further perpetuate an accidental scenario where a slider is unintentionally executed.
So my design has the slider change to a tappable button. But the button doesn’t require just one tap to turn off, it requires 3 taps, in succession with a 9 second cool down. So if a user accidentally taps it, no problem, within 9 seconds the button will return to it’s normal UI state, but 3 intentional taps back to back and the system shuts down. By having the system detect 3 deliberate taps, we eliminate the possibility for many potential issues.
In the end, the system consists of a web interface for venue admins to control events and pre-scripted captions in the form of text scripts that can be uploaded or copied and pasted, an iPad App used by a venue operator that controls the functions of the system for On and Off and the optional swipeable pre-scripted mode, custom software for a server computer that serves audio into the cloud, a cloud server that facilitates the audio to text transcription, and a internal company admin interface to onboard additional venues.
I also designed the logo, branding and marketing for LiveTitles.
This logo went through very few iterations before I settled on it. Mostly I experimented with the angles on the display cabinet icon to get something that felt right but I knew I wanted a 2 color logo with positive and negative space evocative of some of the UI elements in the Apps and expressive of the simplicity the product offers. I remember settling on the font choice because of the clean lines and equidistant geometry. I tend to like “O’s” exactly circular in logos, for example. While there are no O’s in live titles Royal blue, I liked that this font – Sofia – has no serifs or terminals on letters like the lower case T Just nice round circular curves and right angles.
It looks a bit Microsoft “Windowsy” but in the sense that it’s clean and looks nice and official, I think that’s a success.
But wait, there’s more
But what if these LiveTitles are not needed by a patron in the seats and they find them distracting rather than helpful?
I designed this screen that pops up with 2 taps on the display that will hide the screen functionality with a wood style overlay that matches the wooden box, blending seamlessly with the surrounding texture.
The wood overlay image is calibrated to match the wooden display cabinet perfectly on a per venue basis. No matter the color of the display housing, this interface is designed to match perfectly.
There’s been a lot of push back on skeuomorphism since Apple unveiled the flat iOS 7 way back in 2012. But honestly, I think it has swung too far in the other direction. Skeuomorphic elements done with intent and in good taste still add to design. And in this case, serves to blend the UI right in with the custom manufactured wood display cabinets when necessary.
If you would like me to create a complex system that provides an easy experience for your users – with or without skeuomorphism – contact me!