metadata
language: en
license: mit
tags:
- clip
- multimodal
- contrastive-learning
- cultural-heritage
- reevaluate
- information-retrieval
datasets:
- xuemduan/reevaluate-image-text-pairs
model-index:
- name: REEVALUATE CLIP Fine-tuned Models
results:
- task:
type: image-text-retrieval
name: Image-Text Retrieval
dataset:
name: Cultural Heritage Hybrid Dataset
type: xuemduan/reevaluate-image-text-pairs
metrics:
- name: I2T R@1
type: recall@1
value: <TOBE_FILL_IN>
- name: I2T R@5
type: recall@5
value: <TOBE_FILL_IN>
- name: T2I R@1
type: recall@1
value: <TOBE_FILL_IN>
Domain-Adaptive CLIP for Multimodal Retrieval
The fine-tuned CLIP (Vit-L/14) used in Knowledge-Enhanced Multimodal Retrieval
📦 Available Models
| Model | Description | Data Type |
|---|---|---|
reevaluate-clip |
Fine-tuned on images, query texts, and description texts | Image+Text |
🧾 Dataset
The models were trained and evaluated on the REEVLAUATE Image-Text Pair Dataset, which contains 43,500 image–text pairs derived from Wikidata and Pilot Museums.
Each artefact is described by:
Image: artefact imageDescription text: BLIP-generated natural language portion + meatadata portionQuery text: User query-like text
Dataset: xuemduan/reevaluate-image-text-pairs
🚀 Usage
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
model = CLIPModel.from_pretrained("xuemduan/reevaluate-clip")
processor = CLIPProcessor.from_pretrained("xuemduan/reevaluate-clip")
image = Image.open("artefact.jpg")
text = "yellow flower paintings"
image_embeds = model.get_image_features(**processor(images=image, return_tensors="pt"))
text_embeds = model.get_text_features(**processor(text=[text], return_tensors="pt"))
# normalize
image_embeds = image_embeds / image_embeds.norm(dim=-1, keepdim=True)
text_embeds = text_embeds / text_embeds.norm(dim=-1, keepdim=True)
similarity = (image_embeds @ text_embeds.T)
print(similarity)