Vis-IR: Unifying Search With Visualized Information Retrieval

News | Release Plan | Overview | License | Citation

News

2025-04-06 🚀🚀 MVRB Dataset are released on Huggingface: MVRB

2025-04-02 🚀🚀 VIRA Dataset are released on Huggingface: VIRA

2025-04-01 🚀🚀 UniSE models are released on Huggingface: UniSE-MLMM

2025-02-17 🎉🎉 Release our paper: Any Information Is Just Worth One Single Screenshot: Unifying Search With Visualized Information Retrieval.

Release Plan

Overview

In this work, we formally define an emerging IR paradigm called Visualized Information Retrieval, or VisIR, where multimodal information, such as texts, images, tables and charts, is jointly represented by a unified visual format called Screenshots, for various retrieval applications. We further make three key contributions for VisIR. First, we create VIRA (Vis-IR Aggregation), a large-scale dataset comprising a vast collection of screenshots from diverse sources, carefully curated into captioned and questionanswer formats. Second, we develop UniSE (Universal Screenshot Embeddings), a family of retrieval models that enable screenshots to query or be queried across arbitrary data modalities. Finally, we construct MVRB (Massive Visualized IR Benchmark), a comprehensive benchmark covering a variety of task forms and application scenarios. Through extensive evaluations on MVRB, we highlight the deficiency from existing multimodal retrievers and the substantial improvements made by UniSE.

Model Usage

Our code works well on transformers==4.45.2, and we recommend using this version.

1. UniSE-MLLM Models

import torch
from transformers import AutoModel

MODEL_NAME = "marsh123/UniSE-MLLM"
model = AutoModel.from_pretrained(MODEL_NAME, trust_remote_code=True)
                                        # You must set trust_remote_code=True
model.set_processor(MODEL_NAME)

with torch.no_grad():
    device = torch.device("cuda:0")
    model = model.to(device)
    model.eval()
    query_inputs = model.data_process(
        images=["./assets/query_1.png", "./assets/query_2.png"],    
        text=["After a 17% drop, what is Nvidia's closing stock price?",
              "I would like to see a detailed and intuitive performance comparison between the two models."],
        q_or_c="query",
        task_instruction="Represent the given image with the given query."
    )
    candidate_inputs = model.data_process(
        images=["./assets/positive_1.jpeg", "./assets/neg_1.jpeg",
                "./assets/positive_2.jpeg", "./assets/neg_2.jpeg"],
        q_or_c="candidate"
    )
    query_embeddings = model(**query_inputs)
    candidate_embeddings = model(**candidate_inputs)
    scores = torch.matmul(query_embeddings, candidate_embeddings.T)
    print(scores)

Performance on MVRB

MVRB is a comprehensive benchmark designed for the retrieval task centered on screenshots. It includes four meta tasks: Screenshot Retrieval (SR), Composed Screenshot Retrieval (CSR), Screenshot QA (SQA), and Open-Vocabulary Classification (OVC). We evaluate three main types of retrievers on MVRB: OCR+Text Retrievers, General Multimodal Retrievers, and Screenshot Document Retrievers. Our proposed UniSE-MLLM achieves state-of-the-art (SOTA) performance on this benchmark.

License

Vis-IR is licensed under the MIT License.

Citation

If you find this repository useful, please consider giving a star ⭐ and citation

@article{liu2025any,
  title={Any Information Is Just Worth One Single Screenshot: Unifying Search With Visualized Information Retrieval},
  author={Liu, Ze and Liang, Zhengyang and Zhou, Junjie and Liu, Zheng and Lian, Defu},
  journal={arXiv preprint arXiv:2502.11431},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
assets		assets
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vis-IR: Unifying Search With Visualized Information Retrieval

News | Release Plan | Overview | License | Citation

News

Release Plan

Overview

Model Usage

1. UniSE-MLLM Models

Performance on MVRB

License

Citation

About

Releases

Packages

Contributors 3

License

VectorSpaceLab/Vis-IR

Folders and files

Latest commit

History

Repository files navigation

Vis-IR: Unifying Search With Visualized Information Retrieval

News | Release Plan | Overview | License | Citation

News

Release Plan

Overview

Model Usage

1. UniSE-MLLM Models

Performance on MVRB

License

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Packages