Multimodal information refers to data that comes from multiple sensory inputs or modalities. In the human experience, this includes visual, auditory, tactile, olfactory, and gustatory inputs. In the digital realm, multimodal data can include text, images, audio, video, and other forms of structured and unstructured data.
Drawing inspiration from the human brain, advanced AI systems like Mixpeek implement multimodal understanding through a similar architecture of distributed processing and holistic retrieval.
Users can configure which features to extract from videos, including transcriptions, visual descriptions, embeddings, scene detection, face recognition, object labeling, and structured JSON output.
These various features are extracted and indexed separately, allowing for efficient processing and storage of diverse data types.
Users can perform complex queries that span multiple modalities, such as searching for specific spoken phrases combined with visual elements.