Mitigate the Gap: Improving Cross-Modal Alignment in CLIP
jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images
ReaderLM-V2: Small Language Model for HTML to Markdown and JSON
TIPS: Text-Image Pretraining with Spatial Awareness
Cut Cross-Entropy: Memory-Efficient Loss Computation for Large Vocabularies
FlexPrefill: Context-Aware Sparse Attention for Long Sequences
Effective Post-Training Embedding Compression via Temperature Control
Attention in Large Language Models Yields Efficient Zero-Shot Re-Rankers
Bridging and Modeling Correlations in Pairwise Data for Direct Preference Optimization
TAID: Temporally Adaptive Interpolated Distillation for Efficient Knowledge Transfer
SVD-LLM: Truncation-Aware Singular Value Decomposition for Large Language Model Compression
See What You Are Told: Visual Attention Sink in Large Multimodal Models
Towards Semantic Equivalence of Tokenization in Multimodal LLM
Hymba: A Hybrid-head Architecture for Small Language Models
Event
May 25, 2025
What We Learned at ICLR2025
We collect some most interesting papers in ICLR 2025, featuring TIPS, FlexPrefill, Zero-Shot Rerankers, SVD-LLM, Hymba etc.
Jina AI • 21 minutes read
ICLR 2025 is one of the largest and most influential machine learning conferences in the world, standing alongside NeurIPS and ICML as the three premier venues for high-impact AI research. This year marked a historic milestone as ICLR was held in Asia for the first time, taking place at the Singapore EXPO from April 24-28. The timing couldn't have been more perfect—just months after "DeepSeek moment" in late January 2025 that sent shockwaves through Silicon Valley and demonstrated China's rapidly advancing AI research. Combined with the new China-Singapore 30-day mutual visa exemption agreement that took effect in February 2024, we witnessed an unprecedented surge in Chinese participation at the conference.
This year, our team was excited to make the trip to Singapore, with Sedigheh Eslami, Andreas Koukounas, Wang Feng, and CEO Han Xiao presenting three research papers that showcase our latest research on jina-clip-v2 and ReaderLM-v2 for better search. While the rest of the AI world seems locked in an arms race for bigger and bigger models, we decided to swim against the norm—proving that smaller, smarter models can punch well above their weight when you get the design right.
So grab your coffee, get comfortable, and let's explore some ICLR research what we found interesting—beginning with our own take on why small can be mighty.
tagMitigate the Gap: Improving Cross-Modal Alignment in CLIP
CLIP models excel at image-text tasks but suffer from a "modality gap"—image and text embeddings cluster in separate regions, limiting performance. This work, led by our intern Sedigheh Eslami during her PhD at Hasso Plattner Institute, tackles this fundamental issue.
We discovered that simple vector translation breaks embedding structure. Instead, AlignCLIP uses shared encoder parameters with semantically-regularized separation objectives. This dual approach successfully reduces the modality gap while improving performance across zero-shot and fine-tuning tasks.
Takeaways:
Modality gap is a critical CLIP performance bottleneck
This is the paper behind jina-clip-v2, a multilingual multimodal embedding model that supports both text-only and crossmodal tasks using a multi-task, multi-stage contrastive learning approach. The model combines a text encoder (Jina XLM-RoBERTa, 561M parameters) and a vision encoder (EVA02-L14, 304M parameters) for 865M total parameters. We train on multilingual texts from 29 non-English languages and visually-rich documents, employing Matryoshka Representation Learning for flexible embedding dimensionality.
Takeaways:
Mixing image-text and text-text data in single batches with shared temperature parameters performs worse than separate training due to modality information asymmetry.
Training for crossmodal alignment inherently compromises pure text embedding quality, showing a fundamental trade-off.
Cutting embeddings from 1,024 to 256 dimensions causes less than 1% performance loss, revealing massive inefficiency in high-dimensional representations.
tagReaderLM-V2: Small Language Model for HTML to Markdown and JSON
This is the paper behind ReaderLM-v2, a compact 1.5 billion parameter language model designed for efficient web content extraction. The model processes documents up to 512K tokens, transforming messy HTML into clean Markdown or JSON formats. Our approach combines a three-stage data synthesis pipeline (DRAFT-REFINE-CRITIQUE) that generates high-quality training data through iterative refinement with a unified training framework combining continuous pre-training, supervised fine-tuning, direct preference optimization, and self-play iterative tuning. ReaderLM-v2 outperforms GPT-4o and other larger models by 15-20% on benchmarks, particularly excelling at documents exceeding 100K tokens while maintaining significantly lower computational requirements.
Takeaways:
A 1.5B parameter model outperforms GPT-4o and 32B models by 15-20% on HTML extraction, proving that task-specific fine-tuning trumps raw scale for domain expertise.
The model generates its own training data in Stage 4 "self-play," creating better datasets than human-curated ones and continuously improving performance through recursive feedback.
The model suffered from catastrophic token repetition during training, but adding contrastive loss to encourage discriminative representations completely eliminated this degeneration issue.
tagTIPS: Text-Image Pretraining with Spatial Awareness
Vision-language models trained with contrastive learning excel at global image-text alignment but fail at dense spatial understanding tasks. TIPS combines contrastive learning with masked image modeling and uses synthetically generated captions that encode spatial relationships, creating embeddings suitable for both dense and global understanding without task-specific fine-tuning. The approach demonstrates how spatial awareness can be incorporated into embedding models for better document understanding and multimodal retrieval applications.
Takeaways:
Synthetic captions with spatial descriptions provide richer training signals than noisy web captions for learning spatially-aware representations
Combining contrastive image-text learning with self-supervised objectives bridges the gap between global and dense understanding
Off-the-shelf performance on diverse tasks eliminates the need for specialized fine-tuning across different vision applications
tagCut Cross-Entropy: Memory-Efficient Loss Computation for Large Vocabularies
Cross-entropy computation dominates memory usage in large vocabulary language models, requiring materialization of logit matrices proportional to batch_size × vocabulary_size. CCE reformulates the calculation to compute only necessary components on-the-fly using custom CUDA kernels, reducing memory consumption from gigabytes to megabytes while maintaining identical training dynamics. This enables training of embedding and reranking models with larger vocabularies on limited hardware, particularly beneficial for multilingual and domain-specific applications.
Takeaways:
Cross-entropy loss computation can consume 90% of training memory for large vocabulary models, becoming the primary bottleneck
On-the-fly computation of log-sum-exp terms eliminates the need to materialize full logit matrices without mathematical approximations
Long-sequence transformer inference suffers from quadratic attention complexity. FlexPrefill dynamically determines sparse attention patterns per head using Jensen-Shannon divergence and adaptively allocates computational budget based on cumulative attention scores, achieving significant speedups with minimal accuracy loss across diverse content types. The method enables efficient processing of long documents for search and retrieval systems, allowing smaller language models to handle extended contexts for better document understanding.
Takeaways:
Dynamic sparse attention patterns adapted to content type outperform fixed sparsity strategies across different input characteristics
Per-head adaptive budget allocation based on attention score accumulation optimizes computation distribution in real-time
Context-aware sparsity achieves 13.7× speedup with 0.1% accuracy loss while requiring no model retraining
tagEffective Post-Training Embedding Compression via Temperature Control
Temperature scaling in contrastive learning significantly influences the intrinsic dimensionality of learned embeddings, with lower temperatures producing more compressible representations. The paper demonstrates that temperature aggregation methods can reduce embedding dimensions by an order of magnitude while maintaining retrieval performance, revealing the trade-off between clustering effectiveness and retrieval accuracy. This enables efficient deployment of dense retrieval systems where memory constraints are critical for production applications.
Takeaways:
Lower temperature values in contrastive training produce embeddings with lower intrinsic dimensionality that compress more effectively
Temperature aggregation techniques achieve 10× compression ratios with minimal quality degradation across retrieval tasks
Systematic control of temperature during training provides a direct mechanism for optimizing the compression-performance trade-off
tagAttention in Large Language Models Yields Efficient Zero-Shot Re-Rankers
In-Context Re-ranking (ICR) leverages attention pattern changes in LLMs to perform document re-ranking without text generation, reducing computational complexity from O(N log N) to O(1). The method aggregates attention weights across layers and heads to compute relevance scores, with content-free query calibration to mitigate LLM biases. This approach enables efficient re-ranking with open-weight models, eliminating the need for specialized fine-tuning or expensive generation processes.
Takeaways:
Attention patterns in LLMs contain sufficient signals for effective document re-ranking without requiring text generation
Traditional DPO suffers from weak correlations between chosen and rejected responses in preference pairs, limiting alignment effectiveness. BMC addresses this by synthesizing pseudo-preferred responses that interpolate between winning and losing responses, then applies token-level correlation modeling using policy model confidence. The two-phase approach first bridges preference pairs through targeted modifications, then models fine-grained correlations during training to improve learning signal quality.
Takeaways:
Weak correlations between chosen and rejected responses in preference data significantly limit DPO effectiveness for model alignment
Synthesizing pseudo-preferred responses as interpolations between preference pairs provides richer learning signals for optimization
Token-level correlation modeling using policy confidence dynamically weights training signals to capture nuanced variations in preference data
tagTAID: Temporally Adaptive Interpolated Distillation for Efficient Knowledge Transfer
Knowledge distillation faces challenges from capacity gaps, mode averaging, and mode collapse when transferring knowledge between large and small models. TAID introduces a dynamic intermediate teacher that interpolates between student and teacher distributions, gradually adapting the target distribution based on training progress. This approach prevents mode collapse through theoretical guarantees and achieves superior performance across various model sizes, enabling development of compact yet capable language models.
Takeaways:
Dynamic intermediate teachers that adapt during training provide smoother learning trajectories compared to fixed teacher distillation
TAID prevents mode collapse through adaptive interpolation while balancing knowledge transfer across different capacity gaps
The method enables training of state-of-the-art compact models without requiring specialized architectures or extensive hyperparameter tuning
tagSVD-LLM: Truncation-Aware Singular Value Decomposition for Large Language Model Compression
Existing SVD-based compression methods fail to account for input activations during approximation and lack post-truncation fine-tuning. SVD-LLM incorporates truncation-aware data whitening that considers activation distributions and applies LoRA-based fine-tuning after compression. The method establishes theoretical connections between singular values and compression loss, enabling more principled compression decisions that outperform structured pruning and quantization approaches.
Takeaways:
Truncation-aware data whitening that accounts for input activations significantly improves SVD compression effectiveness over activation-agnostic methods
Post-compression LoRA fine-tuning compensates for accuracy degradation while maintaining the benefits of low-rank factorization
Theoretical analysis linking singular values to compression loss enables principled truncation decisions that outperform heuristic approaches
tagSee What You Are Told: Visual Attention Sink in Large Multimodal Models
Large multimodal models exhibit a phenomenon called "visual attention sink" where they consistently allocate high attention weights to specific visual tokens that are irrelevant to corresponding text tokens. These irrelevant visual tokens emerge from massive activation in specific hidden state dimensions, similar to attention sinks in language models. The Visual Attention Redistribution (VAR) method identifies image-centric attention heads and redistributes attention budget from sink tokens to meaningful visual content, improving performance across vision-language tasks without requiring additional training.
Takeaways:
Visual sink tokens can be identified by extreme activation magnitudes in fixed dimensions inherited from base language models
Removing visual sink tokens does not impact model performance despite receiving high attention weights, indicating wasted computational resources
VAR redistributes attention from sink tokens to meaningful visual content, improving performance on general vision-language, hallucination reduction, and vision-centric tasks
tagTowards Semantic Equivalence of Tokenization in Multimodal LLM
Traditional vision tokenization methods in multimodal LLMs fragment visual input using fixed patches, corrupting semantic integrity and leading to poor vision-language alignment. SeTok (Semantic-Equivalent Vision Tokenizer) addresses this through dynamic clustering that groups visual features into coherent semantic units, with token count adapting to image complexity. The system uses dual training objectives: contrastive loss for semantic alignment with language and reconstruction loss to preserve pixel-level details for image reconstruction.
Key Takeaways:
Fixed-patch tokenization disrupts visual semantic integrity by fragmenting objects across arbitrary patch boundaries
Dynamic clustering algorithms can adaptively determine optimal token counts based on image semantic complexity rather than fixed grid structures
Dual objective training balances semantic alignment with language while preserving sufficient visual detail for reconstruction tasks
tagHymba: A Hybrid-head Architecture for Small Language Models
Hymba introduces a hybrid-head architecture that combines transformer attention mechanisms with state space models (SSMs) in parallel within each layer, enabling simultaneous high-resolution recall and efficient context summarization. The architecture incorporates learnable meta tokens, cross-layer key-value sharing, and partial sliding window attention to achieve compact cache sizes. Hymba-1.5B surpasses all sub-2B models and outperforms Llama-3.2-3B while achieving 11.67× cache reduction and 3.49× throughput improvement.
Takeaways:
Parallel hybrid-head architecture outperforms sequential stacking of attention and SSM components by enabling simultaneous processing of complementary mechanisms
Learnable meta tokens act as compressed world knowledge and alleviate the "forced-to-attend" burden of softmax attention mechanisms
Cross-layer key-value sharing and sliding window attention optimizations achieve dramatic cache size reductions without sacrificing performance