News
Models
Products
keyboard_arrow_down
Reader
Convert any URL to Markdown for better grounding LLMs.
Embeddings
World-class multimodal multilingual embeddings.
Reranker
World-class reranker for maximizing search relevancy.
DeepSearch
Search, read and reason until best answer found.
More
keyboard_arrow_down
Classifier
Zero-shot and few-shot classification for image and text.
Segmenter
Cut long text into chunks and do tokenization.

API Docs
Auto codegen for your copilot IDE or LLM
open_in_new


Company
keyboard_arrow_down
About us
Contact sales
Intern program
Join us
open_in_new
Download logo
open_in_new
Terms & Conditions


Log in
login
Overview of Quantization Techniques
Experimental Setup
Results
Conclusion
Tech blog
June 30, 2025

Quantization-Aware Training of jina-embeddings-v4

Quantization gives smaller embeddings. We show you fine-tuned quantization gives you even lossless embeddings.
Andrei Ungureanu
Scott Martens
Bo Wang
Andrei Ungureanu, Scott Martens, Bo Wang • 8 minutes read

Quantization is a widely used for addressing scaling problems in AI. The name makes it sound complicated, but it’s just rounding numbers off to make them take up less space. This means smaller embedding vectors that take up less memory and storage space, and faster information retrieval because it takes less time to compare vectors. Quantization is a purely numerical technique that doesn’t care what kind of data your model processes or what use cases you have, so it can bring improvements without requiring lots of expensive domain knowledge.

One might expect that, quantization involves good-old trade-offs and nothing comes for free cliché - where we must sacrifice some precision. In this article, we’ll show you a way to make it lossless via quantization-aware training (QAT). This technique is used in jina-embeddings-v4 for providing smaller embeddings that required in space-critical applications.

tagOverview of Quantization Techniques

Model quantization usually means one of four things:

  • Post-training quantization (PTQ)
  • Training for quantized embedding outputs (Output QAT)
  • Training for fully quantized models (Full QAT)
  • Distilling a new quantized model from an existing unquantized one

Post-training quantization (PTQ) accepts the trained embedding model as is and doesn’t modify it in any way. It’s just a matter of throwing away the least significant digits of the floating point values produced by the model. We just round the numbers off, and sometimes scale them to a range.

Output QAT means fine-tuning the embedding model to produce optimal reduced-precision vectors. This means modifying the model, but it doesn’t change the precision of the model’s weights, and therefore doesn’t reduce its size. Just the output vector size is reduced.

Full QAT begins with a fully trained, full-precision model and lowers the precision of the model weights, then fine-tunes the performance of this modified model. This produces a significantly smaller model as well as smaller embeddings, at the price of doing some fine-tuning.

Distillation is the process of training a new model to match the performance of an existing one. This means creating a new model that’s designed from scratch as quantized, and then using the existing model to generate as much training data as needed to train it until it performs as closely as possible to the existing model.

The benefits of these four approaches are summarized in the table below:

Approach More Compact Embeddings? Requires Training? Model Compression? Faster Inference?
PTQ ✓ ❌ ❌ ❌
Output QAT ✓ ✓ ❌ ❌
Full QAT ✓ ✓ ✓ ✓
Distillation
(to a smaller model) ✓ ✓ ✓ ✓

All four produce more compact embeddings, but other than PTQ, all require some additional training, while only Full QAT and Distillation produce new, faster models. Full QAT and Distillation are much more expensive to implement because they require a great deal more training than Output QAT.

In this article, we’re only going to look at PTQ and Output QAT, which don’t change the size or speed of the embedding model.

tagExperimental Setup

For these experiments, our baseline model is jina-embeddings-v4 with the retrieval adapter, which produces 32-bit-precision floating-point (FP32) vectors in 2048 dimensions. Each embedding is therefore 8196 bytes, or 8kB in size.

We studied several experimental conditions using query-document retrieval benchmark tasks from the NanoBEIR benchmark suite. The retrieval process uses cosine similarity between vectors to find and rank the documents that best match queries.

  • Baseline — The performance of jina-embeddings-v4 embedding vectors without any quantization. These experiments all used a beta version of the model, and the release performance is somewhat better.
  • PTQ — We quantized the output vectors to binary vectors without changing the model.
  • Output QAT — We quantized the output vectors and applied fine-tuning to the retrieval adapter to improve its performance under quantized conditions.

tagQuantization Levels

Figure 1: Comparison of post-quantization embedding sizes.

We experimented with four different levels of quantization.

  • 8-bit integers — FP32 values are reduced to integers in the range -128 to 127, shrinking embeddings 4-fold to 2048 bytes.
  • 4-bit integers - Same as for 4-bit integers, but we map to the range from -8 to 7, reducing vector sizes by a factor of 8, to 1024 bytes.
  • Trinary Quantization — All values are mapped to one of three values: -1, 0, 1. Optimally stored, this reduces each dimension to 1.6 bits, reducing the size of embedding vectors roughly 40-fold to approximately 230 bytes.
  • Binary Quantization — We convert FP32 scalar values to one bit, using the torch.sign datatype, which provides for just two values, taking one bit to store. This reduces 2048-dimensional embedding vectors from 8192 bytes to 128 bytes, a 64-fold reduction.

tagScaling

For binary quantization, quantization is very simple: If a vector value is above 0 or positive, it maps to 1. Otherwise, it maps to -1.

Figure 2: Binary Quantization. All negative values become -1, all others 1.

For the other quantization scenarios, we normalized the values to a range and then rounded to the nearest value allowed by the level of quantization. Embedding vectors consist of scale numbers between -∞ and +∞ (or, in practice, really big positive and negative numbers). We use two numbers, maxmaxmax and minminmin, to scale the values for quantization.

For trinary quantization, we take each vector component vvv and translate it as follows:

  • if vvv ≥ maxmaxmax, vvv becomes 1.
  • if vvv ≤ minminmin, vvv becomes -1.
  • if minminmin < vvv < maxmaxmax, vvv becomes 0.
Figure 3: Trinary Quantization. An interval is defined and values within it become 0. All lower values become -1, and all higher ones 1.

For 4-bit integers:

  • if vvv ≥ maxmaxmax, vvv becomes 7.
  • if vvv ≤ minminmin, vvv becomes -8.
  • if minminmin < vvv < maxmaxmax, vvv becomes 16∗(v−min)/(max−min)−816*(v - min)/(max - min) - 816∗(v−min)/(max−min)−8, then rounded to the nearest integer. This scales the value to the range [−8,7][-8,7][−8,7].
Figure 4: 4-bit Quantization. An interval is defined and all values are normalized to the defined range [-8,7].

For 8-bit integers:

  • if vvv ≥ maxmaxmax, vvv becomes 127.
  • if vvv ≤ minminmin, vvv becomes -128.
  • if minminmin < vvv < maxmaxmax, vvv becomes 256∗(v−min)/(max−min)−128256*(v - min)/(max - min) - 128256∗(v−min)/(max−min)−128, rounded to the nearest integer. This scales the value to the range [−128,127][-128,127][−128,127].
Figure 5: 8-bit Quantization. An interval is defined and all values are normalized to the defined range [-128,127].

To calculate maxmaxmax and minminmin, we used two approaches:

  • Min/Max — We processed our data in batches, and for each batch, we identified the highest and lowest vector component, setting maxmaxmax to the highest and minminmin to the lowest.
  • Rolling averaging over batches — For each batch, we calculated the average and standard deviation of the vector components. We maintained a moving average of both the average and standard deviation as we processed all batches. If avgavgavg is the current moving average of batch average values, and stdstdstd is the current moving average of the standard deviations, then for each batch:

max=avg+stdmax = avg + stdmax=avg+std
min=avg−stdmin = avg - stdmin=avg−std

tagQAT Fine-Tuning

For the PTQ experiments, we used the model as is and quantized the embeddings it produced using the methods described above.

For the Output QAT, we fine-tuned the model using straight-through estimation. This means that we reverse the quantization process, restoring the full precision to the values, before calculating the loss (i.e., error), and then we use that loss metric to fine-tune the model.

We fine-tuned in each case for 10,000 steps, saving a checkpoint every 500 steps. We then retained the checkpoint with the highest score on the NanoBEIR benchmark.

tagAsymmetric Quantization

PTQ and Output QAT reduce the size of the embedding vectors, but don’t reduce model size or inference speed; all the savings are in the size of the stored document embeddings and retrieval speed.

As a result, we tested both quantizing the query vectors and leaving them unquantized at retrieval time because it doesn’t change the size of the stored embedding vectors either way.

tagResults

We tested nine conditions in total, summarized in the tables below:

Condition Name Fine-Tuning Quantization Level Scaling Strategy Quantized Queries
Baseline ❌ n/a n/a n/a
PTQ Both ❌ Binary n/a ✓
PTQ Docs Only ❌ Binary n/a ❌
QAT Binary ✓ Binary n/a ✓
QAT Binary Docs Only ✓ Binary n/a ❌
QAT Trinary ✓ Trinary Rolling Average ✓
QAT 4-bits ✓ 4-bits Rolling Average ✓
QAT 8-bits ✓ 8-bits Rolling Average ✓
QAT 8-bits Min/Max ✓ 8-bits Min/Max ✓

Table 2: Experimental Conditions

Condition Name Average Score Difference from baseline
Baseline 60.10 n/a
PTQ Binary 58.33 -1.78
PTQ Binary Docs Only 59.08 -1.02
QAT Binary 59.22 -0.89
QAT Binary Docs Only 60.81 +0.70
QAT Trinary 59.49 -0.62
QAT 4-bits 61.73 +1.62
QAT 8-bits 61.67 +1.56
QAT 8-bits Min/Max 61.29 +1.19

Table 3: Average score (in % correct) for each condition over the twelve NanoBEIR benchmarks.

You can see from the table above that fine-tuning for quantization improves scores. The only difference between the PTQ Binary and QAT Binary conditions is fine-tuning, and the difference in score is significant. Similarly, we see an almost 2% improvement in scores between the PTQ Binary Docs Only and QAT Binary Docs Only conditions, which are only distinguished by the same fine-tuning.

Unsurprisingly, we also see that scores generally improve the less we quantize, with 4-bit quantization scoring better than trinary, and trinary better than binary. However, going further to 8-bits doesn’t appear to have improved anything.

We only tested leaving queries unquantized in binary cases, but this appears to improve performance.

Finally, our tests suggest that the rolling average scaling method outperforms the simplistic min/max approach.

tagConclusion

Quantization has some important operational advantages for embedding models, by significantly reducing the size of embedding vectors and accelerating information retrieval. While simple post-training quantization (PTQ) provides immediate benefits in terms of memory and storage, our experiments demonstrate that quantization-aware training (QAT) significantly mitigates the inevitable precision losses. Fine-tuning consistently yielded better scores.

The degree of quantization directly impacts performance, which is what you would expect from a method based on reducing the precision of values. Less aggressive quantization (e.g., 4-bit) generally outperforms more aggressive methods (e.g., binary), but surprisingly, there was no significant difference in performance between 8-bit and 4-bit quantization. It would seem that until you reach some threshold of imprecision, there is little difference between greater and lesser quantization.

Scaling strategies are also significant, with the rolling average method showing superior results compared to a fixed min/max approach. Using scaling values that are relative to the data appears to work significantly better and merits further exploration.

Quantization can get you more out of your embedding models for less. Although this article doesn’t explore all the options for quantization, it explores two that are easily accessible, and they have real benefits to offer. We’re working to refine and improve quantization strategies so that we can further reduce users' costs, and expect to release binary support for jina-embeddings-v4 in the near future.

Categories:
Tech blog
rss_feed
Offices
location_on
Sunnyvale, CA
710 Lakeway Dr, Ste 200, Sunnyvale, CA 94085, USA
location_on
Berlin, Germany (HQ)
Prinzessinnenstraße 19-20, 10969 Berlin, Germany
location_on
Beijing, China
Level 5, Building 6, No.48 Haidian West St. Beijing, China
location_on
Shenzhen, China
402 Floor 4, Fu'an Technology Building, Shenzhen, China
Search Foundation
Reader
Embeddings
Reranker
DeepSearch
Classifier
Segmenter
API Documentation
Get Jina API key
Rate Limit
API Status
Company
About us
Contact sales
Newsroom
Intern program
Join us
open_in_new
Download logo
open_in_new
Terms
Security
Terms & Conditions
Privacy
Manage Cookies
email
Jina AI © 2020-2025.