AnyModal is a modular and extensible framework for integrating diverse input modalities (e.g., images, audio) into large language models (LLMs). It enables seamless tokenization, encoding, and language generation using pre-trained models for various modalities.
The best way to get started with AnyModal is to have a read-through of the steps below and then see the examples provided in the demos directory. Also, check out the anymodal.py file to understand the core components of the framework.
Furthormore, you can change the core components of AnyModal to suit your needs. Consider implementing the functionalities like saving and loading models or pushing the saved projectors and LoRAs to the HF hub.
Note that the demos are still in progress, and there is still room for improvement. Do you have any other ideas for AnyModal demos? Feel free to suggest them!
Contributions are highly welcome! Whether it's fixing bugs, improving documentation, or adding support for new input modalities, your help is appreciated. Here's how you can contribute: