What We Learned at ICLR2025

ICLR 2025 is one of the largest and most influential machine learning conferences in the world, standing alongside NeurIPS and ICML as the three premier venues for high-impact AI research. This year marked a historic milestone as ICLR was held in Asia for the first time, taking place at the Singapore EXPO from April 24-28. The timing couldn't have been more perfect—just months after "DeepSeek moment" in late January 2025 that sent shockwaves through Silicon Valley and demonstrated China's rapidly advancing AI research. Combined with the new China-Singapore 30-day mutual visa exemption agreement that took effect in February 2024, we witnessed an unprecedented surge in Chinese participation at the conference.

Conference hall bustling with people interacting around exhibits and posters. A red carpet runs through the hall under a stee

This year, our team was excited to make the trip to Singapore, with Sedigheh Eslami, Andreas Koukounas, Wang Feng, and CEO Han Xiao presenting three research papers that showcase our latest research on jina-clip-v2 and ReaderLM-v2 for better search. While the rest of the AI world seems locked in an arms race for bigger and bigger models, we decided to swim against the norm—proving that smaller, smarter models can punch well above their weight when you get the design right.

Four people smiling and posing in front of an “ICLR” sign, conveying a friendly and professional atmosphere.

So grab your coffee, get comfortable, and let's explore some ICLR research what we found interesting—beginning with our own take on why small can be mighty.

Mitigate the Gap: Investigating Approaches for Improving Cross-Modal Alignment in CLIP

Contrastive Language--Image Pre-training (CLIP) has manifested remarkable improvements in zero-shot classification and cross-modal vision-language tasks. Yet, from a geometrical point of view, the CLIP embedding space has been found to have a pronounced modality gap. This gap renders the embedding space overly sparse and disconnected, with different modalities being densely distributed in distinct subregions of the hypersphere. In this work, we aim at answering three main questions: 1. Does sharing the parameter space between the multi-modal encoders reduce the modality gap? 2. Can the gap be mitigated by pushing apart the uni-modal embeddings via intra-modality separation? 3. How do these gap reduction approaches affect the downstream performance? We design AlignCLIP, in order to answer these questions and through extensive experiments, we show that AlignCLIP achieves noticeable enhancements in the cross-modal alignment of the embeddings, and thereby, reduces the modality gap, while improving the performance across several zero-shot and fine-tuning downstream evaluations.

arXiv.orgSedigheh Eslami

A woman with glasses stands near a research poster titled “Mitigate the Gap: Improving Cross-Modal Alignment in CL,” engaging

CLIP models excel at image-text tasks but suffer from a "modality gap"—image and text embeddings cluster in separate regions, limiting performance. This work, led by our intern Sedigheh Eslami during her PhD at Hasso Plattner Institute, tackles this fundamental issue.

We discovered that simple vector translation breaks embedding structure. Instead, AlignCLIP uses shared encoder parameters with semantically-regularized separation objectives. This dual approach successfully reduces the modality gap while improving performance across zero-shot and fine-tuning tasks.

Takeaways:

Modality gap is a critical CLIP performance bottleneck
Parameter sharing + semantic separation effectively bridges modal differences
The approach delivers measurable gains on downstream evaluations

tagjina-clip-v2: Multilingual Multimodal Embeddings for Text and Images

jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images

Contrastive Language-Image Pretraining (CLIP) has been widely used for crossmodal information retrieval and multimodal understanding tasks. However, CLIP models are mainly optimized for crossmodal vision-language tasks and underperform in single-mode text tasks. Moreover, these models are often trained on English datasets and therefore lack multilingual understanding. Additionally, from a visual understanding perspective, previous CLIP-based models exhibit insufficient understanding of visually rich documents. In this work, we propose jina-clip-v2, a contrastive vision-language model trained on text pairs, triplets and image-text pairs via a multi-task and multi-stage contrastive learning paradigm in order to support both text-only and crossmodal tasks. We employ a multilingual text encoder and expand the training dataset to include multilingual texts from 29 non-English languages, including Hindi, Chinese, German, French, and others, as well as images of visually rich documents. We evaluate the model’s performance and show that jina-clip-v2 achieves notable improvements over state-of-the-art CLIP-based models in zero-shot text-only retrieval, semantic textual similarity, and crossmodal retrieval tasks in both English and multilingual settings. jina-clip-v2 also provides for flexibility in embedding dimensionality, enabling users to select the granularity of the representations. jina-clip-v2 is publicly available at https://huggingface.co/jinaai/jina-clip-v2.

arXiv.orgAndreas Koukounas

A conference presentation with a speaker in a white shirt holding a laptop, standing in front of an audience and a large slid

This is the paper behind jina-clip-v2, a multilingual multimodal embedding model that supports both text-only and crossmodal tasks using a multi-task, multi-stage contrastive learning approach. The model combines a text encoder (Jina XLM-RoBERTa, 561M parameters) and a vision encoder (EVA02-L14, 304M parameters) for 865M total parameters. We train on multilingual texts from 29 non-English languages and visually-rich documents, employing Matryoshka Representation Learning for flexible embedding dimensionality.

Takeaways:

Mixing image-text and text-text data in single batches with shared temperature parameters performs worse than separate training due to modality information asymmetry.
Training for crossmodal alignment inherently compromises pure text embedding quality, showing a fundamental trade-off.
Cutting embeddings from 1,024 to 256 dimensions causes less than 1% performance loss, revealing massive inefficiency in high-dimensional representations.

tagReaderLM-V2: Small Language Model for HTML to Markdown and JSON

ReaderLM-v2: Small Language Model for HTML to Markdown and JSON

We present ReaderLM-v2, a compact 1.5 billion parameter language model designed for efficient web content extraction. Our model processes documents up to 512K tokens, transforming messy HTML into clean Markdown or JSON formats with high accuracy -- making it an ideal tool for grounding large language models. The model’s effectiveness results from two key innovations: (1) a three-stage data synthesis pipeline that generates high quality, diverse training data by iteratively drafting, refining, and critiquing web content extraction; and (2) a unified training framework combining continuous pre-training with multi-objective optimization. Intensive evaluation demonstrates that ReaderLM-v2 outperforms GPT-4o-2024-08-06 and other larger models by 15-20\% on carefully curated benchmarks, particularly excelling at documents exceeding 100K tokens, while maintaining significantly lower computational requirements.

arXiv.orgFeng Wang

Two people casually discussing content on a poster taped to a wall, conveying an informal yet scholarly atmosphere.

This is the paper behind ReaderLM-v2, a compact 1.5 billion parameter language model designed for efficient web content extraction. The model processes documents up to 512K tokens, transforming messy HTML into clean Markdown or JSON formats. Our approach combines a three-stage data synthesis pipeline (DRAFT-REFINE-CRITIQUE) that generates high-quality training data through iterative refinement with a unified training framework combining continuous pre-training, supervised fine-tuning, direct preference optimization, and self-play iterative tuning. ReaderLM-v2 outperforms GPT-4o and other larger models by 15-20% on benchmarks, particularly excelling at documents exceeding 100K tokens while maintaining significantly lower computational requirements.

Takeaways:

A 1.5B parameter model outperforms GPT-4o and 32B models by 15-20% on HTML extraction, proving that task-specific fine-tuning trumps raw scale for domain expertise.
The model generates its own training data in Stage 4 "self-play," creating better datasets than human-curated ones and continuously improving performance through recursive feedback.
The model suffered from catastrophic token repetition during training, but adding contrastive loss to encourage discriminative representations completely eliminated this degeneration issue.

tagTIPS: Text-Image Pretraining with Spatial Awareness

TIPS: Text-Image Pretraining with Spatial awareness

While image-text representation learning has become very popular in recent years, existing models tend to lack spatial awareness and have limited direct applicability for dense understanding tasks. For this reason, self-supervised image-only pretraining is still the go-to method for many dense vision applications (e.g. depth estimation, semantic segmentation), despite the lack of explicit supervisory signals. In this paper, we close this gap between image-text and self-supervised learning, by proposing a novel general-purpose image-text model, which can be effectively used off the shelf for dense and global vision tasks. Our method, which we refer to as Text-Image Pretraining with Spatial awareness (TIPS), leverages two simple and effective insights. First, on textual supervision: we reveal that replacing noisy web image captions by synthetically generated textual descriptions boosts dense understanding performance significantly, due to a much richer signal for learning spatially aware representations. We propose an adapted training method that combines noisy and synthetic captions, resulting in improvements across both dense and global understanding tasks. Second, on the learning technique: we propose to combine contrastive image-text learning with self-supervised masked image modeling, to encourage spatial coherence, unlocking substantial enhancements for downstream applications. Building on these two ideas, we scale our model using the transformer architecture, trained on a curated set of public images. Our experiments are conducted on 8 tasks involving 16 datasets in total, demonstrating strong off-the-shelf performance on both dense and global understanding, for several image-only and image-text tasks. Code and models are released at https://github.com/google-deepmind/tips.

arXiv.orgKevis-Kokitsi Maninis

A research presentation on "TIPS: Text-Image Pretraining with Spatial Awareness" by Google DeepMind, featuring sections, char

Vision-language models trained with contrastive learning excel at global image-text alignment but fail at dense spatial understanding tasks. TIPS combines contrastive learning with masked image modeling and uses synthetically generated captions that encode spatial relationships, creating embeddings suitable for both dense and global understanding without task-specific fine-tuning. The approach demonstrates how spatial awareness can be incorporated into embedding models for better document understanding and multimodal retrieval applications.

Takeaways:

Synthetic captions with spatial descriptions provide richer training signals than noisy web captions for learning spatially-aware representations
Combining contrastive image-text learning with self-supervised objectives bridges the gap between global and dense understanding
Off-the-shelf performance on diverse tasks eliminates the need for specialized fine-tuning across different vision applications

tagCut Cross-Entropy: Memory-Efficient Loss Computation for Large Vocabularies

Cut Your Losses in Large-Vocabulary Language Models

As language models grow ever larger, so do their vocabularies. This has shifted the memory footprint of LLMs during training disproportionately to one single layer: the cross-entropy in the loss computation. Cross-entropy builds up a logit matrix with entries for each pair of input tokens and vocabulary items and, for small models, consumes an order of magnitude more memory than the rest of the LLM combined. We propose Cut Cross-Entropy (CCE), a method that computes the cross-entropy loss without materializing the logits for all tokens into global memory. Rather, CCE only computes the logit for the correct token and evaluates the log-sum-exp over all logits on the fly. We implement a custom kernel that performs the matrix multiplications and the log-sum-exp reduction over the vocabulary in flash memory, making global memory consumption for the cross-entropy computation negligible. This has a dramatic effect. Taking the Gemma 2 (2B) model as an example, CCE reduces the memory footprint of the loss computation from 24 GB to 1 MB, and the total training-time memory consumption of the classifier head from 28 GB to 1 GB. To improve the throughput of CCE, we leverage the inherent sparsity of softmax and propose to skip elements of the gradient computation that have a negligible (i.e., below numerical precision) contribution to the gradient. Experiments demonstrate that the dramatic reduction in memory consumption is accomplished without sacrificing training speed or convergence.

arXiv.orgErik Wijmans

A scientific poster titled “Cut Your Losses in Large Vocabulary Language Models” on display in a professional, academic confe

Cross-entropy computation dominates memory usage in large vocabulary language models, requiring materialization of logit matrices proportional to batch_size × vocabulary_size. CCE reformulates the calculation to compute only necessary components on-the-fly using custom CUDA kernels, reducing memory consumption from gigabytes to megabytes while maintaining identical training dynamics. This enables training of embedding and reranking models with larger vocabularies on limited hardware, particularly beneficial for multilingual and domain-specific applications.

Takeaways:

Cross-entropy loss computation can consume 90% of training memory for large vocabulary models, becoming the primary bottleneck
On-the-fly computation of log-sum-exp terms eliminates the need to materialize full logit matrices without mathematical approximations
Custom kernel implementation enables dramatic memory reduction while preserving exact convergence properties

tagFlexPrefill: Context-Aware Sparse Attention for Long Sequences

FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference

Large language models (LLMs) encounter computational challenges during long-sequence inference, especially in the attention pre-filling phase, where the complexity grows quadratically with the prompt length. Previous efforts to mitigate these challenges have relied on fixed sparse attention patterns or identifying sparse attention patterns based on limited cases. However, these methods lacked the flexibility to efficiently adapt to varying input demands. In this paper, we introduce FlexPrefill, a Flexible sparse Pre-filling mechanism that dynamically adjusts sparse attention patterns and computational budget in real-time to meet the specific requirements of each input and attention head. The flexibility of our method is demonstrated through two key innovations: 1) Query-Aware Sparse Pattern Determination: By measuring Jensen-Shannon divergence, this component adaptively switches between query-specific diverse attention patterns and predefined attention patterns. 2) Cumulative-Attention Based Index Selection: This component dynamically selects query-key indexes to be computed based on different attention patterns, ensuring the sum of attention scores meets a predefined threshold. FlexPrefill adaptively optimizes the sparse pattern and sparse ratio of each attention head based on the prompt, enhancing efficiency in long-sequence inference tasks. Experimental results show significant improvements in both speed and accuracy over prior methods, providing a more flexible and efficient solution for LLM inference.

arXiv.orgXunhao Lai

Academic poster titled "A Context-Aware Sparse Attention Mechanism for Efficient Long Sequence Inference" summarizing a resea

Long-sequence transformer inference suffers from quadratic attention complexity. FlexPrefill dynamically determines sparse attention patterns per head using Jensen-Shannon divergence and adaptively allocates computational budget based on cumulative attention scores, achieving significant speedups with minimal accuracy loss across diverse content types. The method enables efficient processing of long documents for search and retrieval systems, allowing smaller language models to handle extended contexts for better document understanding.

Takeaways:

Dynamic sparse attention patterns adapted to content type outperform fixed sparsity strategies across different input characteristics
Per-head adaptive budget allocation based on attention score accumulation optimizes computation distribution in real-time
Context-aware sparsity achieves 13.7× speedup with 0.1% accuracy loss while requiring no model retraining

tagEffective Post-Training Embedding Compression via Temperature Control

Effective post-training embedding compression via temperature...

Fixed-size learned representations (dense representations, or embeddings) are widely used in many machine learning applications across language, vision or speech modalities. This paper investigates…

OpenReview.netGeorgiana Dinu

A scientific poster titled "Effective post-training embedding compression via temperature control in contrastive training" fe

Temperature scaling in contrastive learning significantly influences the intrinsic dimensionality of learned embeddings, with lower temperatures producing more compressible representations. The paper demonstrates that temperature aggregation methods can reduce embedding dimensions by an order of magnitude while maintaining retrieval performance, revealing the trade-off between clustering effectiveness and retrieval accuracy. This enables efficient deployment of dense retrieval systems where memory constraints are critical for production applications.

Takeaways:

Lower temperature values in contrastive training produce embeddings with lower intrinsic dimensionality that compress more effectively
Temperature aggregation techniques achieve 10× compression ratios with minimal quality degradation across retrieval tasks
Systematic control of temperature during training provides a direct mechanism for optimizing the compression-performance trade-off

tagAttention in Large Language Models Yields Efficient Zero-Shot Re-Rankers

Attention in Large Language Models Yields Efficient Zero-Shot Re-Rankers

Information retrieval (IR) systems have played a vital role in modern digital life and have cemented their continued usefulness in this new era of generative AI via retrieval-augmented generation. With strong language processing capabilities and remarkable versatility, large language models (LLMs) have become popular choices for zero-shot re-ranking in IR systems. So far, LLM-based re-ranking methods rely on strong generative capabilities, which restricts their use to either specialized or powerful proprietary models. Given these restrictions, we ask: is autoregressive generation necessary and optimal for LLMs to perform re-ranking? We hypothesize that there are abundant signals relevant to re-ranking within LLMs that might not be used to their full potential via generation. To more directly leverage such signals, we propose in-context re-ranking (ICR), a novel method that leverages the change in attention pattern caused by the search query for accurate and efficient re-ranking. To mitigate the intrinsic biases in LLMs, we propose a calibration method using a content-free query. Due to the absence of generation, ICR only requires two (

O(1)

) forward passes to re-rank

N

documents, making it substantially more efficient than generative re-ranking methods that require at least

O(N)

forward passes. Our novel design also enables ICR to be applied to any LLM without specialized training while guaranteeing a well-formed ranking. Extensive experiments with two popular open-weight LLMs on standard single-hop and multi-hop information retrieval benchmarks show that ICR outperforms RankGPT while cutting the latency by more than 60% in practice. Through detailed analyses, we show that ICR’s performance is specially strong on tasks that require more complex re-ranking signals. Our findings call for further exploration on novel ways of utilizing open-weight LLMs beyond text generation.

arXiv.orgShijie Chen

Academic poster titled "Attention in Large Language Models Yields Efficient Zero-Shot Re-Rankers," authored by Chen, Gutierre

In-Context Re-ranking (ICR) leverages attention pattern changes in LLMs to perform document re-ranking without text generation, reducing computational complexity from O(N log N) to O(1). The method aggregates attention weights across layers and heads to compute relevance scores, with content-free query calibration to mitigate LLM biases. This approach enables efficient re-ranking with open-weight models, eliminating the need for specialized fine-tuning or expensive generation processes.

Takeaways:

Attention patterns in LLMs contain sufficient signals for effective document re-ranking without requiring text generation
Content-free query calibration successfully mitigates intrinsic biases in attention-based scoring mechanisms
ICR achieves superior performance and efficiency compared to generative methods, particularly on complex multi-hop retrieval tasks

tagBridging and Modeling Correlations in Pairwise Data for Direct Preference Optimization

Bridging and Modeling Correlations in Pairwise Data for Direct Preference Optimization

Direct preference optimization (DPO), a widely adopted offline preference optimization algorithm, aims to align large language models (LLMs) with human-desired behaviors using pairwise preference data. However, the generation of the winning response and the losing response within pairwise data are typically isolated, leading to weak correlations between them as well as suboptimal alignment performance. To address this issue, we propose an effective framework for Bridging and Modeling Correlations in pairwise data, named BMC. Firstly, we increase the consistency and informativeness of the pairwise preference signals through targeted modifications, synthesizing a pseudo-winning response by improving the losing response with the winning response as a reference. Secondly, we identify that DPO alone is insufficient to model these correlations and capture nuanced variations. Therefore, we propose learning token-level correlations by dynamically leveraging the policy model’s confidence during training. Comprehensive experiments on QA, math, and instruction-following tasks demonstrate the effectiveness of our approach, significantly surpassing competitive baselines, including DPO. Additionally, our in-depth quantitative analysis reveals the reasons behind our method’s superior performance over DPO and showcases its versatility to other DPO variants. We release our repository at https://github.com/YJiangcm/BMC.

arXiv.orgYuxin Jiang

Research poster detailing "Bridging and Modeling Correlations in Pairwise Data for Direct Preference Optimization," structure

Traditional DPO suffers from weak correlations between chosen and rejected responses in preference pairs, limiting alignment effectiveness. BMC addresses this by synthesizing pseudo-preferred responses that interpolate between winning and losing responses, then applies token-level correlation modeling using policy model confidence. The two-phase approach first bridges preference pairs through targeted modifications, then models fine-grained correlations during training to improve learning signal quality.

Takeaways:

Weak correlations between chosen and rejected responses in preference data significantly limit DPO effectiveness for model alignment
Synthesizing pseudo-preferred responses as interpolations between preference pairs provides richer learning signals for optimization
Token-level correlation modeling using policy confidence dynamically weights training signals to capture nuanced variations in preference data

tagTAID: Temporally Adaptive Interpolated Distillation for Efficient Knowledge Transfer

TAID: Temporally Adaptive Interpolated Distillation for Efficient Knowledge Transfer in Language Models

Causal language models have demonstrated remarkable capabilities, but their size poses significant challenges for deployment in resource-constrained environments. Knowledge distillation, a widely-used technique for transferring knowledge from a large teacher model to a small student model, presents a promising approach for model compression. A significant remaining issue lies in the major differences between teacher and student models, namely the substantial capacity gap, mode averaging, and mode collapse, which pose barriers during distillation. To address these issues, we introduce

\textit{Temporally Adaptive Interpolated Distillation (TAID)}

, a novel knowledge distillation approach that dynamically interpolates student and teacher distributions through an adaptive intermediate distribution, gradually shifting from the student’s initial distribution towards the teacher’s distribution. We provide a theoretical analysis demonstrating TAID’s ability to prevent mode collapse and empirically show its effectiveness in addressing the capacity gap while balancing mode averaging and mode collapse. Our comprehensive experiments demonstrate TAID’s superior performance across various model sizes and architectures in both instruction tuning and pre-training scenarios. Furthermore, we showcase TAID’s practical impact by developing two state-of-the-art compact foundation models:

\texttt{TAID-LLM-1.5B}

for language tasks and

\texttt{TAID-VLM-2B}

for vision-language tasks. These results demonstrate TAID’s effectiveness in creating high-performing and efficient models, advancing the development of more accessible AI technologies.

arXiv.orgMakoto Shing

Side-by-side comparison illustrating two methods of knowledge distillation, Standard KD with fixed teacher distributions and

Knowledge distillation faces challenges from capacity gaps, mode averaging, and mode collapse when transferring knowledge between large and small models. TAID introduces a dynamic intermediate teacher that interpolates between student and teacher distributions, gradually adapting the target distribution based on training progress. This approach prevents mode collapse through theoretical guarantees and achieves superior performance across various model sizes, enabling development of compact yet capable language models.

Takeaways:

Dynamic intermediate teachers that adapt during training provide smoother learning trajectories compared to fixed teacher distillation
TAID prevents mode collapse through adaptive interpolation while balancing knowledge transfer across different capacity gaps
The method enables training of state-of-the-art compact models without requiring specialized architectures or extensive hyperparameter tuning

tagSVD-LLM: Truncation-Aware Singular Value Decomposition for Large Language Model Compression

SVD-LLM: Truncation-aware Singular Value Decomposition for Large Language Model Compression

The advancements in Large Language Models (LLMs) have been hindered by their substantial sizes, which necessitates LLM compression methods for practical deployment. Singular Value Decomposition (SVD) offers a promising solution for LLM compression. However, state-of-the-art SVD-based LLM compression methods have two key limitations: truncating smaller singular values may lead to higher compression loss, and the lack of update on the compressed weights after SVD truncation. In this work, we propose SVD-LLM, a SVD-based post-training LLM compression method that addresses the limitations of existing methods. SVD-LLM incorporates a truncation-aware data whitening technique to ensure a direct mapping between singular values and compression loss. Moreover, SVD-LLM adopts a parameter update with sequential low-rank approximation to compensate for the accuracy degradation after SVD compression. We evaluate SVD-LLM on 10 datasets and seven models from three different LLM families at three different scales. Our results demonstrate the superiority of SVD-LLM over state-of-the-arts, especially at high model compression ratios. Our code is available at https://github.com/AIoT-MLSys-Lab/SVD-LLM

arXiv.orgXin Wang

Poster titled "SVD-LLM: Truncation-aware Singular Value Decomposition for Large Language Model Compression" with diagrams, di

Existing SVD-based compression methods fail to account for input activations during approximation and lack post-truncation fine-tuning. SVD-LLM incorporates truncation-aware data whitening that considers activation distributions and applies LoRA-based fine-tuning after compression. The method establishes theoretical connections between singular values and compression loss, enabling more principled compression decisions that outperform structured pruning and quantization approaches.

Takeaways:

Truncation-aware data whitening that accounts for input activations significantly improves SVD compression effectiveness over activation-agnostic methods
Post-compression LoRA fine-tuning compensates for accuracy degradation while maintaining the benefits of low-rank factorization
Theoretical analysis linking singular values to compression loss enables principled truncation decisions that outperform heuristic approaches

tagSee What You Are Told: Visual Attention Sink in Large Multimodal Models

See What You Are Told: Visual Attention Sink in Large Multimodal Models

Large multimodal models (LMMs) “see” images by leveraging the attention mechanism between text and visual tokens in the transformer decoder. Ideally, these models should focus on key visual information relevant to the text token. However, recent findings indicate that LMMs have an extraordinary tendency to consistently allocate high attention weights to specific visual tokens, even when these tokens are irrelevant to the corresponding text. In this study, we investigate the property behind the appearance of these irrelevant visual tokens and examine their characteristics. Our findings show that this behavior arises due to the massive activation of certain hidden state dimensions, which resembles the attention sink found in language models. Hence, we refer to this phenomenon as the visual attention sink. In particular, our analysis reveals that removing the irrelevant visual sink tokens does not impact model performance, despite receiving high attention weights. Consequently, we recycle the attention to these tokens as surplus resources, redistributing the attention budget to enhance focus on the image. To achieve this, we introduce Visual Attention Redistribution (VAR), a method that redistributes attention in image-centric heads, which we identify as innately focusing on visual information. VAR can be seamlessly applied across different LMMs to improve performance on a wide range of tasks, including general vision-language tasks, visual hallucination tasks, and vision-centric tasks, all without the need for additional training, models, or inference steps. Experimental results demonstrate that VAR enables LMMs to process visual information more effectively by adjusting their internal attention mechanisms, offering a new direction to enhancing the multimodal capabilities of LMMs.

arXiv.orgSeil Kang

A research poster titled “See What You Are Told: Visual Attention Sink in Large Multimodal Models” displayed on a large board

Large multimodal models exhibit a phenomenon called "visual attention sink" where they consistently allocate high attention weights to specific visual tokens that are irrelevant to corresponding text tokens. These irrelevant visual tokens emerge from massive activation in specific hidden state dimensions, similar to attention sinks in language models. The Visual Attention Redistribution (VAR) method identifies image-centric attention heads and redistributes attention budget from sink tokens to meaningful visual content, improving performance across vision-language tasks without requiring additional training.

Takeaways:

Visual sink tokens can be identified by extreme activation magnitudes in fixed dimensions inherited from base language models
Removing visual sink tokens does not impact model performance despite receiving high attention weights, indicating wasted computational resources
VAR redistributes attention from sink tokens to meaningful visual content, improving performance on general vision-language, hallucination reduction, and vision-centric tasks

tagTowards Semantic Equivalence of Tokenization in Multimodal LLM

Towards Semantic Equivalence of Tokenization in Multimodal LLM

Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in processing vision-language tasks. One of the crux of MLLMs lies in vision tokenization, which involves efficiently transforming input visual signals into feature representations that are most beneficial for LLMs. However, existing vision tokenizers, essential for semantic alignment between vision and language, remain problematic. Existing methods aggressively fragment visual input, corrupting the visual semantic integrity. To address this, this paper proposes a novel dynamic Semantic-Equivalent Vision Tokenizer (SeTok), which groups visual features into semantic units via a dynamic clustering algorithm, flexibly determining the number of tokens based on image complexity. The resulting vision tokens effectively preserve semantic integrity and capture both low-frequency and high-frequency visual features. The proposed MLLM (Setokim) equipped with SeTok significantly demonstrates superior performance across various tasks, as evidenced by our experimental results. The project page is at https://chocowu.github.io/SeTok-web/.

arXiv.orgShengqiong Wu

Research poster titled "Towards Semantic Equivalence of Tokenization in Multi-Modal LLM" featuring charts and images, display

Traditional vision tokenization methods in multimodal LLMs fragment visual input using fixed patches, corrupting semantic integrity and leading to poor vision-language alignment. SeTok (Semantic-Equivalent Vision Tokenizer) addresses this through dynamic clustering that groups visual features into coherent semantic units, with token count adapting to image complexity. The system uses dual training objectives: contrastive loss for semantic alignment with language and reconstruction loss to preserve pixel-level details for image reconstruction.

Key Takeaways:

Fixed-patch tokenization disrupts visual semantic integrity by fragmenting objects across arbitrary patch boundaries
Dynamic clustering algorithms can adaptively determine optimal token counts based on image semantic complexity rather than fixed grid structures
Dual objective training balances semantic alignment with language while preserving sufficient visual detail for reconstruction tasks

tagHymba: A Hybrid-head Architecture for Small Language Models

Hymba: A Hybrid-head Architecture for Small Language Models

We propose Hymba, a family of small language models featuring a hybrid-head parallel architecture that integrates transformer attention mechanisms with state space models (SSMs) for enhanced efficiency. Attention heads provide high-resolution recall, while SSM heads enable efficient context summarization. Additionally, we introduce learnable meta tokens that are prepended to prompts, storing critical information and alleviating the “forced-to-attend” burden associated with attention mechanisms. This model is further optimized by incorporating cross-layer key-value (KV) sharing and partial sliding window attention, resulting in a compact cache size. During development, we conducted a controlled study comparing various architectures under identical settings and observed significant advantages of our proposed architecture. Notably, Hymba achieves state-of-the-art results for small LMs: Our Hymba-1.5B-Base model surpasses all sub-2B public models in performance and even outperforms Llama-3.2-3B with 1.32% higher average accuracy, an 11.67x cache size reduction, and 3.49x throughput.

arXiv.orgXin Dong

A person points to a research poster titled "Hymba: A Hybrid-head Architecture for Small Language Models" during an academic

Hymba introduces a hybrid-head architecture that combines transformer attention mechanisms with state space models (SSMs) in parallel within each layer, enabling simultaneous high-resolution recall and efficient context summarization. The architecture incorporates learnable meta tokens, cross-layer key-value sharing, and partial sliding window attention to achieve compact cache sizes. Hymba-1.5B surpasses all sub-2B models and outperforms Llama-3.2-3B while achieving 11.67× cache reduction and 3.49× throughput improvement.

Takeaways:

Parallel hybrid-head architecture outperforms sequential stacking of attention and SSM components by enabling simultaneous processing of complementary mechanisms
Learnable meta tokens act as compressed world knowledge and alleviate the "forced-to-attend" burden of softmax attention mechanisms
Cross-layer key-value sharing and sliding window attention optimizations achieve dramatic cache size reductions without sacrificing performance

What We Learned at ICLR2025

tagMitigate the Gap: Improving Cross-Modal Alignment in CLIP

tagjina-clip-v2: Multilingual Multimodal Embeddings for Text and Images

tagReaderLM-V2: Small Language Model for HTML to Markdown and JSON

tagTIPS: Text-Image Pretraining with Spatial Awareness

tagCut Cross-Entropy: Memory-Efficient Loss Computation for Large Vocabularies

tagFlexPrefill: Context-Aware Sparse Attention for Long Sequences

tagEffective Post-Training Embedding Compression via Temperature Control

tagAttention in Large Language Models Yields Efficient Zero-Shot Re-Rankers

tagBridging and Modeling Correlations in Pairwise Data for Direct Preference Optimization

tagTAID: Temporally Adaptive Interpolated Distillation for Efficient Knowledge Transfer

tagSVD-LLM: Truncation-Aware Singular Value Decomposition for Large Language Model Compression

tagSee What You Are Told: Visual Attention Sink in Large Multimodal Models

tagTowards Semantic Equivalence of Tokenization in Multimodal LLM

tagHymba: A Hybrid-head Architecture for Small Language Models