An AI Agent Prototype for Assistance in Creating Video Compilations
Summary:
- AI-Powered Video Compilation: Veritone’s Digital Media Hub utilizes AI, including semantic search and Large Language Models (LLMs), to help users create compilation videos by generating storylines and search queries based on a user-provided vision.
- User-Friendly Interface: The prototype features an intuitive interface that allows users to interact with suggested search queries and select video segments from their library, facilitating a mix of traditional point-and-click and natural language interactions.
- Future Enhancements: Improvement plans include enabling iterative conversations with the LLM, automating search queries based on results, and potentially integrating text-to-speech for narration, which would enhance the overall user experience in video compilation.
Veritone’s Digital Media Hub (DMH) is an AI-powered media asset management and monetization solution, which many of Veritone’s customers in media and entertainment (and also other sectors) use to store their massive collections of video material. A typical task for someone working with a big video library is to create a compilation video for a specific topic using clips from the library.
One key ingredient for solving this task is a powerful search engine, which has always been at the core of our DMH software, and which is currently being further improved by adding semantic search capabilities based on multi-modal embeddings (in short: finding what the user means rather than what they type, by not simply relying on keyword search, but instead using, e.g., a combined image/text embedding model such as OpenAI’s CLIP or AWS Bedrock’s Titan).
However, this article focuses on another important aspect of this task—coming up with a storyline and good search terms to find suitable video clips to add to the compilation. That’s an excellent subproblem to use a Large Language Model (LLM) for assistance, which is why the Veritone Labs team has created a prototype to explore this use case. Currently, the web prototype can do the following:
- The user begins by providing an “overall vision” for the compilation video with a text description of what it should be about.
- An LLM is invoked with this user-provided overall vision and instructions to create five parts of the video compilation (basically, the overall story) and suggest five search queries for finding suitable clips in the video library for each part.
- Using a DMH-like search user interface, the user then runs these LLM-suggested searches against the video library (or comes up with their own ideas) and selects video segments to include in the compilation.
- Optionally, an additional music or audio track can be uploaded.
- Finally, the prototype creates a new video clip by extracting the selected video segments from the respective source videos, concatenating them with cross-fade, and by overlaying the music track. The user can preview or download the final compilation video.
The “overall vision” for the compilation is meant to guide video selection and, therefore, can set the tone and feeling of the final video. For example, a vision statement might be: “Advertisement for a family beach vacation on Crete with videos from different times of the day.” Additional signals can be included to inform the video clip selection process, including video popularity, current events and hypes, and a specification of the target audience. Let’s walk through an example.
Let’s say we’re in a hurry and want to leave almost all the creative/cognitive work to the LLM by providing an overall vision consisting of only a single word: “animals”. We then click on the button “Auto-generate Parts” to invoke the LLM.
Under the hood, this is what the prototype receives back from the LLM:
Based on the overall vision of “animals,” I’ll suggest 5 ideas for parts of the compilation and provide search queries for each. Here are the ideas and queries:
Description: Showcase powerful and impressive wild animals in their natural habitats. Queries: – Lion roaring in savanna – Elephant herd walking – Eagle soaring over mountains – Wolf howling in forest – Whale breaching ocean surface
Description: Highlight adorable baby animals playing or interacting with their parents. Queries: – Kitten playing with yarn – Puppy learning to walk – Baby penguin waddling – Fawn nuzzling mother deer – Baby monkey clinging to mother
Description: Feature surprising or lesser-known animal behaviors that might intrigue viewers. Queries: – Octopus camouflaging – Beaver building dam – Chameleon changing colors – Flamingos performing mating dance – Leafcutter ants carrying leaves
Description: Show the diversity of animal habitats across the globe. Queries: – Polar bear on ice floe – Camel walking in desert – Toucan in rainforest canopy – Kangaroo hopping in Australian outback – Puffin on coastal cliff
Description: Depict positive interactions between humans and animals. Queries: – Zookeeper feeding giraffe – Child petting friendly dog – Scuba diver swimming with dolphins – Farmer tending to cows – Falconer with bird of prey |
Nice! The LLM has come up with some useful ideas for our video compilation.
The LLM also returns its results as structured output (JSON; not shown above), which is then used by the prototype to render the web user interface appropriately, such that the user only has to click the respective buttons to run the searches the model suggested against their video collection:
In the above screenshot, we are looking at Part 1 of our compilation video under construction, where the LLM suggested the topic “Majestic Wild Animals” with the description “Showcase powerful and impressive wild animals in their natural habitats”, and we just ran the first suggested search query, “Lion roaring in savanna”. Of course the search results heavily depend on the video material present in the library.
At this point, the user browses the search results and either selects one of the found video segments to use for Part 1 or tries one of the other suggested search queries by clicking on the button “Elephant herd walking”, for example, or enters their own search query in the input field. After we have found and selected a video segment we want to use for the first part of our compilation video, we move on to the next part (“Cute Baby Animals”), and so on.
Here’s a demo video of the above:
A mixed user interface and user experience
We’re quite excited about this interesting mix of the user interface and user experience of traditional point-and-click actions on a web page on the one hand and natural language on the other hand. Both have their strengths and weaknesses and are superior to the other in certain scenarios. Nobody probably wants to type “play the third video in the second row” into a chat input box, but you could just click on the video thumbnail instead. But for (semantic) search itself, as well as for the core process of the described prototype (creating the story and “populating” it with promising search queries to try), the full expressive power and flexibility of natural language and the impressive capabilities of today’s LLMs and multi-modal models can make a huge difference.
We plan to explore this concept a lot more, as we believe there is tremendous potential for use cases like the one and for all kinds of software products, like Veritone Digital Media Hub. It’s not enough to have a chat window on the side where you can also talk to an LLM. It’s about deep integration of an agentic model, which you can chat with, yes. But, we also need to make the agent aware of “traditional” user interactions like button clicks as they take place, so it has access to the complete current context. And for interacting with the user, we need to give the agent richer possibilities than just sending chat messages. It needs the capability to present video search results, for example, and ask the user to select something, in the example use case of this article. From today’s perspective, it looks like amazing things can be built here, but there is a lot of work to be done.
Future work for the video compilation assistant
Carrying these last thoughts back to our video compilation assistant prototype, there is plenty of room for improvement. For example, a current drawback is that the (potentially quite detailed) user input at the beginning of the pipeline, the “overall vision”, is written once, and then the agent does its magic. It would be better to enter more of a conversation with the agent, such as asking (in natural language) for specific adaptations after seeing the first results and iterating several times rather than starting from scratch.
Instead of the current text input field for the “overall vision,” this would require a conversation chat window somewhere in the UI and a way of holistically representing the current state of the user session internally. This state representation would need to be interpreted and modified by both the traditional interface (say, via JavaScript) and the agent (essentially, an LLM), resulting in a smooth and efficient overall experience for the user, where both parts of the user interface change appropriately given the latest context transitions.
Furthermore, the agent could do some more work for us under the hood than just generating search query terms; it could actually run the searches for us, look at the results, and decide on further actions. For example, if a search query returns too few or unfitting results, the agent could (without user interaction in between) think of new search queries, run those, etc., until it has found something useful to present to the user.
Finally, given that a key strength of LLMs is to generate text, we could ask the model to also create a narration script for our compilation video, synthesize that script using a text-to-speech solution like Veritone Voice, and add the resulting narration to the audio track of the final compilation video.
Learn more about Digital Media Hub and Veritone’s other AI solutions for media and entertainment. Organizations are also taking advantage of our complimentary AI readiness assessment by visiting our custom AI solutions page for your use case.