Quantization-Aware Training of jina-embeddings-v4

Quantization is a widely used for addressing scaling problems in AI. The name makes it sound complicated, but it’s just rounding numbers off to make them take up less space. This means smaller embedding vectors that take up less memory and storage space, and faster information retrieval because it takes less time to compare vectors. Quantization is a purely numerical technique that doesn’t care what kind of data your model processes or what use cases you have, so it can bring improvements without requiring lots of expensive domain knowledge.

One might expect that, quantization involves good-old trade-offs and nothing comes for free cliché - where we must sacrifice some precision. In this article, we’ll show you a way to make it lossless via quantization-aware training (QAT). This technique is used in jina-embeddings-v4 for providing smaller embeddings that required in space-critical applications.

tagOverview of Quantization Techniques

Model quantization usually means one of four things:

Post-training quantization (PTQ)
Training for quantized embedding outputs (Output QAT)
Training for fully quantized models (Full QAT)
Distilling a new quantized model from an existing unquantized one

Post-training quantization (PTQ) accepts the trained embedding model as is and doesn’t modify it in any way. It’s just a matter of throwing away the least significant digits of the floating point values produced by the model. We just round the numbers off, and sometimes scale them to a range.

Output QAT means fine-tuning the embedding model to produce optimal reduced-precision vectors. This means modifying the model, but it doesn’t change the precision of the model’s weights, and therefore doesn’t reduce its size. Just the output vector size is reduced.

Full QAT begins with a fully trained, full-precision model and lowers the precision of the model weights, then fine-tunes the performance of this modified model. This produces a significantly smaller model as well as smaller embeddings, at the price of doing some fine-tuning.

Distillation is the process of training a new model to match the performance of an existing one. This means creating a new model that’s designed from scratch as quantized, and then using the existing model to generate as much training data as needed to train it until it performs as closely as possible to the existing model.

The benefits of these four approaches are summarized in the table below:

Approach	More Compact Embeddings?	Requires Training?	Model Compression?	Faster Inference?
PTQ	✓	❌	❌	❌
Output QAT	✓	✓	❌	❌
Full QAT	✓	✓	✓	✓
Distillation
(to a smaller model)	✓	✓	✓	✓

All four produce more compact embeddings, but other than PTQ, all require some additional training, while only Full QAT and Distillation produce new, faster models. Full QAT and Distillation are much more expensive to implement because they require a great deal more training than Output QAT.

In this article, we’re only going to look at PTQ and Output QAT, which don’t change the size or speed of the embedding model.

tagExperimental Setup

For these experiments, our baseline model is jina-embeddings-v4 with the retrieval adapter, which produces 32-bit-precision floating-point (FP32) vectors in 2048 dimensions. Each embedding is therefore 8196 bytes, or 8kB in size.

We studied several experimental conditions using query-document retrieval benchmark tasks from the NanoBEIR benchmark suite. The retrieval process uses cosine similarity between vectors to find and rank the documents that best match queries.

Baseline — The performance of jina-embeddings-v4 embedding vectors without any quantization. These experiments all used a beta version of the model, and the release performance is somewhat better.
PTQ — We quantized the output vectors to binary vectors without changing the model.
Output QAT — We quantized the output vectors and applied fine-tuning to the retrieval adapter to improve its performance under quantized conditions.

tagQuantization Levels

Figure 1: Comparison of post-quantization embedding sizes.

We experimented with four different levels of quantization.

8-bit integers — FP32 values are reduced to integers in the range -128 to 127, shrinking embeddings 4-fold to 2048 bytes.
4-bit integers - Same as for 4-bit integers, but we map to the range from -8 to 7, reducing vector sizes by a factor of 8, to 1024 bytes.
Trinary Quantization — All values are mapped to one of three values: -1, 0, 1. Optimally stored, this reduces each dimension to 1.6 bits, reducing the size of embedding vectors roughly 40-fold to approximately 230 bytes.
Binary Quantization — We convert FP32 scalar values to one bit, using the torch.sign datatype, which provides for just two values, taking one bit to store. This reduces 2048-dimensional embedding vectors from 8192 bytes to 128 bytes, a 64-fold reduction.

tagScaling

For binary quantization, quantization is very simple: If a vector value is above 0 or positive, it maps to 1. Otherwise, it maps to -1.

Figure 2: Binary Quantization. All negative values become -1, all others 1.

For the other quantization scenarios, we normalized the values to a range and then rounded to the nearest value allowed by the level of quantization. Embedding vectors consist of scale numbers between -∞ and +∞ (or, in practice, really big positive and negative numbers). We use two numbers, $max$ and $min$ , to scale the values for quantization.

For trinary quantization, we take each vector component $v$ and translate it as follows:

if $v$ ≥ $max$ , $v$ becomes 1.
if $v$ ≤ $min$ , $v$ becomes -1.
if $min$ < $v$ < $max$ , $v$ becomes 0.

Figure 3: Trinary Quantization. An interval is defined and values within it become 0. All lower values become -1, and all higher ones 1.

For 4-bit integers:

if $v$ ≥ $max$ , $v$ becomes 7.
if $v$ ≤ $min$ , $v$ becomes -8.
if $min$ < $v$ < $max$ , $v$ becomes $16*(v - min)/(max - min) - 8$ , then rounded to the nearest integer. This scales the value to the range $[-8,7]$ .

Figure 4: 4-bit Quantization. An interval is defined and all values are normalized to the defined range [-8,7].

For 8-bit integers:

if $v$ ≥ $max$ , $v$ becomes 127.
if $v$ ≤ $min$ , $v$ becomes -128.
if $min$ < $v$ < $max$ , $v$ becomes $256*(v - min)/(max - min) - 128$ , rounded to the nearest integer. This scales the value to the range $[-128,127]$ .

Figure 5: 8-bit Quantization. An interval is defined and all values are normalized to the defined range [-128,127].

To calculate $max$ and $min$ , we used two approaches:

Min/Max — We processed our data in batches, and for each batch, we identified the highest and lowest vector component, setting $max$ to the highest and $min$ to the lowest.
Rolling averaging over batches — For each batch, we calculated the average and standard deviation of the vector components. We maintained a moving average of both the average and standard deviation as we processed all batches. If $avg$ is the current moving average of batch average values, and $std$ is the current moving average of the standard deviations, then for each batch:

$max = avg + std$
$min = avg - std$

tagQAT Fine-Tuning

For the PTQ experiments, we used the model as is and quantized the embeddings it produced using the methods described above.

For the Output QAT, we fine-tuned the model using straight-through estimation. This means that we reverse the quantization process, restoring the full precision to the values, before calculating the loss (i.e., error), and then we use that loss metric to fine-tune the model.

We fine-tuned in each case for 10,000 steps, saving a checkpoint every 500 steps. We then retained the checkpoint with the highest score on the NanoBEIR benchmark.

tagAsymmetric Quantization

PTQ and Output QAT reduce the size of the embedding vectors, but don’t reduce model size or inference speed; all the savings are in the size of the stored document embeddings and retrieval speed.

As a result, we tested both quantizing the query vectors and leaving them unquantized at retrieval time because it doesn’t change the size of the stored embedding vectors either way.

tagResults

We tested nine conditions in total, summarized in the tables below:

Condition Name	Fine-Tuning	Quantization Level	Scaling Strategy	Quantized Queries
Baseline	❌	n/a	n/a	n/a
PTQ Both	❌	Binary	n/a	✓
PTQ Docs Only	❌	Binary	n/a	❌
QAT Binary	✓	Binary	n/a	✓
QAT Binary Docs Only	✓	Binary	n/a	❌
QAT Trinary	✓	Trinary	Rolling Average	✓
QAT 4-bits	✓	4-bits	Rolling Average	✓
QAT 8-bits	✓	8-bits	Rolling Average	✓
QAT 8-bits Min/Max	✓	8-bits	Min/Max	✓

Table 2: Experimental Conditions

Condition Name	Average Score	Difference from baseline
Baseline	60.10	n/a
PTQ Binary	58.33	-1.78
PTQ Binary Docs Only	59.08	-1.02
QAT Binary	59.22	-0.89
QAT Binary Docs Only	60.81	+0.70
QAT Trinary	59.49	-0.62
QAT 4-bits	61.73	+1.62
QAT 8-bits	61.67	+1.56
QAT 8-bits Min/Max	61.29	+1.19

Table 3: Average score (in % correct) for each condition over the twelve NanoBEIR benchmarks.

You can see from the table above that fine-tuning for quantization improves scores. The only difference between the PTQ Binary and QAT Binary conditions is fine-tuning, and the difference in score is significant. Similarly, we see an almost 2% improvement in scores between the PTQ Binary Docs Only and QAT Binary Docs Only conditions, which are only distinguished by the same fine-tuning.

Unsurprisingly, we also see that scores generally improve the less we quantize, with 4-bit quantization scoring better than trinary, and trinary better than binary. However, going further to 8-bits doesn’t appear to have improved anything.

We only tested leaving queries unquantized in binary cases, but this appears to improve performance.

Finally, our tests suggest that the rolling average scaling method outperforms the simplistic min/max approach.

tagConclusion

Quantization has some important operational advantages for embedding models, by significantly reducing the size of embedding vectors and accelerating information retrieval. While simple post-training quantization (PTQ) provides immediate benefits in terms of memory and storage, our experiments demonstrate that quantization-aware training (QAT) significantly mitigates the inevitable precision losses. Fine-tuning consistently yielded better scores.

The degree of quantization directly impacts performance, which is what you would expect from a method based on reducing the precision of values. Less aggressive quantization (e.g., 4-bit) generally outperforms more aggressive methods (e.g., binary), but surprisingly, there was no significant difference in performance between 8-bit and 4-bit quantization. It would seem that until you reach some threshold of imprecision, there is little difference between greater and lesser quantization.

Scaling strategies are also significant, with the rolling average method showing superior results compared to a fixed min/max approach. Using scaling values that are relative to the data appears to work significantly better and merits further exploration.

Quantization can get you more out of your embedding models for less. Although this article doesn’t explore all the options for quantization, it explores two that are easily accessible, and they have real benefits to offer. We’re working to refine and improve quantization strategies so that we can further reduce users' costs, and expect to release binary support for jina-embeddings-v4 in the near future.