The rise of multi-modal models like Gemini has ushered in a new era of LLMs that seamlessly integrate video, text, images, and other inputs. As we’ve been using Gemini at Decipher AI for session replay analysis, we’ve learned several insights that have significantly enhanced the performance for us.
In this post, we'll delve into our discoveries and provide actionable tips to help you maximize the effectiveness of multi-modal models in your projects.
Video quality plays a crucial role in a model's ability to interpret visual content accurately. Some models, including Google Gemini, typically extract frames at a rate of 1 frame per second (FPS). Don’t forget normal movies are 24fps. While this works well for slower-paced content, it can lead to significant loss of detail in fast-moving sequences. This detail is super important when it comes to the LLMs ability to understand what’s going on the video.
Our solution? Slow down the videos. This simple adjustment allows the model to capture more frames during critical moments, resulting in more accurate analysis. For instance, when examining a user interaction video showcasing a product page load, reducing the speed can reveal subtle errors like delayed UI element rendering that might otherwise go unnoticed.