Model Card - CLIP Zero-Shot

Overview

The CLIP (ViT-B/32) model is used off-the-shelf for zero-shot vibe matching.
It maps user-entered movie-review text and outfit images into a shared embedding space and ranks outfits by cosine similarity (vibe alignment).


Model Details

Field Description
Developed by Bareethul Kader & Nada Khan
Framework Hugging Face Transformers
Base Model openai/clip-vit-base-patch32
Repository bareethulk/Forma
License MIT (OpenAI CLIP)

Intended Use

Direct Use

  • Zero-shot text–image matching for outfit recommendations.
  • Core engine of the Gradio demo app.

Out-of-Scope Use

  • Not fine-tuned for specific fashion styles.
  • May inherit biases from large-scale web data.

Dataset

Evaluation on nadakandrew/closet_multimodal_v1
Paired image–text inputs for vibe ranking.


Evaluation Setup

  • Mode: Zero-shot classification + ranking
  • Metric Space: Cosine similarity (512-D)
  • Results:
    • Accuracy: 91 %
    • Precision@5: 1.00
    • NDCG@5: 0.96
    • MRR: 0.95

Interpretation: CLIP outperforms the trained ResNet18 (48 %) by a large margin, highlighting the power of pre-trained vision–language models for vibe alignment.


Limitations / Ethical Notes

  • May reproduce biases from web data.
  • Does not capture deep emotional context behind reviews.
  • Research / educational use only.
Downloads last month
16
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support