MIthun667/Vision-Language-Model — reverse-engineered prompt

Reverse engineered prompt

GitHub

Build me a Jupyter notebook project for meme classification using both the image and the text from each meme. I want it to use a Vision Transformer for the image side and a BERT style encoder for the text side, then combine them with a linear self attentive fusion layer so it can learn how the words and picture relate.

The notebook should train and evaluate multi task labels like sentiment, sarcasm, offensiveness, and prejudice. Please include clear data loading steps for meme images and captions, model training, validation, testing, saved metrics, and simple charts so I can understand how well it worked.

Also add comparison models like image only, text only, and a few simple multimodal baselines, plus an ablation section where parts of the model can be turned off to compare results. Keep the code readable and organized, with comments explaining what each section does. Look up current docs online if you need to.

Want more depth? Deep Reverse