Model Card - CLIP Zero-Shot
Overview
The CLIP (ViT-B/32) model is used off-the-shelf for zero-shot vibe matching.
It maps user-entered movie-review text and outfit images into a shared embedding space and ranks outfits by cosine similarity (vibe alignment).
Model Details
| Field | Description |
|---|---|
| Developed by | Bareethul Kader & Nada Khan |
| Framework | Hugging Face Transformers |
| Base Model | openai/clip-vit-base-patch32 |
| Repository | bareethulk/Forma |
| License | MIT (OpenAI CLIP) |
Intended Use
Direct Use
- Zero-shot text–image matching for outfit recommendations.
- Core engine of the Gradio demo app.
Out-of-Scope Use
- Not fine-tuned for specific fashion styles.
- May inherit biases from large-scale web data.
Dataset
Evaluation on nadakandrew/closet_multimodal_v1
Paired image–text inputs for vibe ranking.
Evaluation Setup
- Mode: Zero-shot classification + ranking
- Metric Space: Cosine similarity (512-D)
- Results:
- Accuracy: 91 %
- Precision@5: 1.00
- NDCG@5: 0.96
- MRR: 0.95
Interpretation: CLIP outperforms the trained ResNet18 (48 %) by a large margin, highlighting the power of pre-trained vision–language models for vibe alignment.
Limitations / Ethical Notes
- May reproduce biases from web data.
- Does not capture deep emotional context behind reviews.
- Research / educational use only.
- Downloads last month
- 16