CLIP Prefix Caption - Conceptual Captions Model
Image captioning model based on CLIP and GPT-2, trained on Conceptual Captions dataset.
Model Details
- Model Type: CLIP Prefix Captioning
- Architecture: CLIP Vision Encoder + MLP Mapping + GPT-2 Text Decoder
- Dataset: Conceptual Captions
- Prefix Length: 10 tokens
- CLIP Model: ViT-B/32
- GPT-2 Model: gpt2
Usage
See the test notebook for usage examples.
Files
model.pt: Model checkpoint (state_dict)
Citation
If you use this model, please cite:
@article{mokady2021clipcap,
title={ClipCap: CLIP Prefix for Image Captioning},
author={Mokady, Ron and Hertz, Amir and Bermano, Amit H},
journal={arXiv preprint arXiv:2111.09734},
year={2021}
}
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support