Change log

Feb 10: Add Elephant in the room.
Feb 09: Move back from al-folio to Quarto :).

For a long time, traditional ML models focus only on one modality, such as text, image, or audio. As humans, we are capable of dealing with multiple modalities at the same time. The human experience is multimodal. We see, hear, smell, and touch. In addition, the Internet and many applications nowadays are multimodal. For example, we can rarely see any webpages that contain only text or images, and we hardly watch any video without any sound. The explosion of Internet data comes with vast data resources and availability that we can use to train advanced AI models.

While there are many different modalities that have been considered in the ML community, in this blog post, we will focus on vision and text modality. There are many different ways that we can use to categorize Vision-Language Models, either based on their training paradigms, input/output modalities, or pre-training objectives. Following Bordes et al. (2024) [1], one way of categorizing different VLMs is via their training paradigms, which include 1) contrastive; 2) masking; 3) generative; and 4) using pre-trained backbones. These methods are, however, not mutually exclusive, there have been work that combine two or more training paradigms. Among the most popular VLMs, VILA [2] and its variants are a family of models that focus on efficiency while maintaining impressive performance.

VILA

Given the explosion of LLMs in recent years, it is natural to come up with ways to extend such LLMs to be able to work with the visual modality (eg images and videos). VILA [2] is a VLM introduced by NVIDIA and MIT that is based on such approach by integrating visual and textual information. It is pre-trained based on the recipes that the authors observe. At its core, VILA consists of three main components, including 1) a visual encoder, 2) an LLM, and 3) a projector. The vision backbone is CLIP visual encoder [3] and the LLM used is Llama2-7b and Llama2-13b [4]. A projector acts as a mapping to connect the images and text embeddings onto the same space, it can either be as simple as a linear layer or MLP, or as complicated as a Transformer block. VILA takes visual and text inputs and generates text outputs. This section outlines several design choices to improve the model’s performance based on empirical observation.

Full fine-tuning or visual prompt tuning ?

The two common methods that have often been used to augment LLMs with visual inputs is either to 1) fine-tune the LLM with the visual tokens (also referred to as (full) fine-tuning) or 2) only train the visual input while freezing the weights of the LLM (often known as (visual) prompt tuning). The prompt-tuning approach treats the visual input projector as a way to “prompt” the LLM with visual embeddings while keeping the text-only model intact. It can be thought of as a “translator” between the vision encoder and the text-only LLM, effectively acting as the “prompt.”

The question of whether to use the former or latter approach comes with different tradeoffs. While fine-tuning is essential for the VLM to inherit the LLMs’ in-context learning ability, it comes with the risk of catastrophic forgetting, where the model’s ability degrades on text-only tasks. Such a problem can be prevented by using prompt tuning, however, this leads to poor performance and degrades in-context learning capabilities.

Table 1: Performance comparison of different LLM training settings with varying projector types and fine-tuning strategies. 0s denotes zero-shot and 4s denotes 4-shot setting.

Setting	PreT	SFT	Projector	OKVQA (0s)	OKVQA (4s)	TextVQA (0s)	TextVQA (4s)	COCO (0s)	COCO (4s)	Flickr (0s)	Flickr (4s)	Average (0s)	Average (4s)
a)	✗	✗	Transformer	10.4	19.2	14.8	23.1	17.4	60.2	11.0	47.4	13.4	37.5
b)	✗	✓	Transformer	47.1	47.7	37.2	36.6	109.4	88.0	73.6	58.1	66.8	57.6
c)	✓	✓	Transformer	44.8	49.8	38.5	38.8	112.3	113.5	71.5	72.9	66.8	68.8
d)	✓	✓	Linear	45.2	50.3	39.7	40.2	115.7	118.5	74.2	74.7	68.7	70.9

Full fine-tuning is important. In the VILA paper, the authors did experiments with different settings (A-D, where setting a) and b) focus on prompt tuning (LLM is frozen); c) and d) involve fine-tuning (LLM is updated). Table 1 shows that freezing the LLM during pre-training does not negatively impact zero-shot accuracy, but it results in a reduced ability for in-context learning. Another observation from this experiment is that using a simple projector (linear layer) leads to better performance. The hypothesis is that using a simple projector enforces the LLM to learn more from visual tokens, and thus the model can generalize better.

**Left.** Visual Prompt Tuning freezes the LLMs weights in \(N\) layers \(L_1, L_2, \cdots, L_N\), updating only a subset of parameters, which lead to in-context ability degradation. **Middle.** Full Fine-Tuning updates the weights of the LLM in every layer. **Right.** Bar chart comparing their performance in 0-shot (0s) and 4-shot (4s) settings. Figure adapted from Cheng et al. (2024) [5], bar chart based on Lin et al. (2024)’s empirical results [2].

The benefits of full finetuning approach is, however, depends on the size of the downstream dataset size, as shown by Cheng et al. (2024) in Figure 1.

Figure 1: VTAB-1k analysis across Natural (left), Specialized (middle), and Structured (right) tasks. FT (solid) vs. VPT (dotted) accuracy shows FT outperforming VPT in 9/12 cases as dataset size increases.

Training Data

Training data is one of the most important aspects of training VLMs ¹. Some common practices to preprocess the image-text data are to 1) remove the duplicate, 2) balance the labels, 3) prune (removing the data pairs where the captions do not match with the image), and 4) improve the captions with other LLMs.

The two most common types of data used for training VLMs include interleaved image-text data and image-text pair data (i.e. the images and their corresponding captions). In the VILA paper, the authors train the model with different types of data to analyze their influence on the model’s performance.

Interleaved or Image-text pair Data ?

Figure 2. *An example of the difference between interleaved image-text data (top) and image-text pair data (bottom).*

The VILA paper shows that interleaved image-text data like MMC4 [6] is more effective regarding training VLM because it prevents the model from catastrophic forgetting, which can degrade the model’s text capability. For example, using COYO [7], an image-text pair dataset for pretraining can lead to text-only accuracy (MMLU) degrading by 17.2%, while training the model using MMC4 interleaved dataset degrades text-only accuracy by only 5%. This degradation gap can further be addressed when using a larger LLM. A similar conclusion is provided in the Flamingo paper [8], where interleaved data is found to be better to reserve the model’s text-only capability, as shown in Table 2.

Table 2: Ablated study on pre-training data for Flamingo-3B model. Without training the model on M3W, an interleaved image-text data, the model overall performance drops significantly. Table Source: Alayrac et al., (2022) [8].

Ablated Setting	Step	COCO CIDEr ↑	OKVQA top1 ↑	VQAv2 top1 ↑	MSVDQA top1 ↑	VATEX CIDEr ↑	Overall score ↑
Flamingo-3B model	1.74s	86.5	42.1	55.8	36.3	53.4	70.7
w/o Video-Text pairs	1.42s	84.2	43.0	53.9	34.5	46.0	67.3
w/o Image-Text pairs	0.95s	66.3	39.2	51.6	32.0	41.6	60.9
Image-Text pairs → LAION	1.74s	79.5	41.4	53.5	33.9	47.6	66.4
w/o M3W	1.02s	54.1	36.5	52.7	31.4	23.5	53.4

Given this observation, the authors propose Joint Supervised Fine-tuning (SFT), which blends text-only instruction data with visual instruction-tuning data to improve the model’s textual ability. Such training technique, according to the ablation study, reduces the degradation in text-only accuracy while improving the visual language accuracy at the same time, as shown in Table 3.

Table 3: Ablation study for Joint SFT. Additionally adding text instruction tuning data for SFT helps reduce the model drop on MMLU benchmark. Table Source: Lin et al., (2024) [2].

PT Data	SFT Data	VLM Acc. (avg)	MMLU Acc.
Llama-2	-	-	46.0%
MMC4	Visual	68.7% / 70.9%	40.7% (-5.3%)
MMC4+COYO	Visual	69.0% / 71.3%	40.2% (-5.8%)
Llama-2	Text	-	51.2%
MMC4	Vis.+Text	71.0% / 72.1%	51.4% (+0.2%)
MMC4+COYO	Vis.+Text	72.3% / 73.6%	50.9% (-0.3%)

Image resolution or number of tokens ?

TL;DR: Image resolution

In VILA experiments, as shown in Table 4, increasing the image resolution from 224 to 336 increases the TextVQA benchmark from 41.6% to 49.8%. The tradeoff is that a higher-resolution image leads to more tokens per image (336x336 corresponds to 576 tokens/image) and a higher computational cost, which is even worse for video understanding given the limited context length.

Table 4: Ablation study for image resolution. Table Source: Lin et al., (2024) [2].

Resolution	Projector	#Tokens	OKVQA	TextVQA	COCO
224	linear	256	49.9%	41.6%	116.0
336	linear	576	49.7%	49.8%	117.7
336	downsample	144	49.3%	45.6%	115.7

However, the raw resolution matters more than the number of visual tokens/images. The authors’ solution is to reduce the number of visual tokens using a downsample projector that concatenated every 2x2 token into a single one and used the linear layer to fuse the information. Using such a downsample projector improves the TextVQA accuracy from 41.6% to 45.6%, compared to 49.8% when we naively feed the 336x336 image.

Data quality or quantity?

TL;DR: Data quality

In VILA experiments, scaling up the dataset size from 25M to 50M doesn’t give much benefit, but adding 1M of high-quality data improves the benchmark results. The data quality is quantified using the CLIPScore [9].

Table 5: Effect of data quality on ImageNet accuracy. Training on DATAComp-1B achieves the highest accuracy with ~60.8% less data.

Dataset	Dataset size	# samples seen	Architecture	Train compute (MACs)	ImageNet accuracy
OpenAI’s WIT	0.4B	13B	ViT-L/14	1.1 × 10²¹	75.5%
LAION-400M	0.4B	13B	ViT-L/14	1.1 × 10²¹	72.8%
LAION-2B	2.3B	13B	ViT-L/14	1.1 × 10²¹	73.1%
LAION-2B	2.3B	34B	ViT-H/14	6.5 × 10²¹	78.0%
LAION-2B	2.3B	34B	ViT-g/14	9.9 × 10²¹	78.5%
DATAComp-1B	1.4B	13B	ViT-L/14	1.1 × 10²¹	79.2%

There exists another benchmark such as DATAComp [10] with the same observation that data quality matters more than quantity regarding training VLMs. Table 5 from the DATAComp paper shows the effect of data quality on the ImageNet classification task.

VILA\(^2\)

Data curation for training vision-language models (VLMs) is both costly and challenging, often limited by closed-source model APIs or the use of noisy web-crawled data. When data is sourced from the internet, its quality cannot be reliably ensured. One problem with Internet-crawled and available public image-text datasets such as MMC4 [6] and COYO [7] is that the captions associated with the images are often of short length and that human descriptions of the image often lack semantic elements (which is essential for LLM). shows the average caption length. It can roughly be inferred that the longer the caption length, the better the performance of the models on VQAv2 benchmark. Note that the MMC4 dataset is not re-captioned because of its interleaved nature.

Table 6: The effect of caption length on the model performance on VQAv2 benchmark. Only the COYO dataset is recaptioned. The MMC4 dataset is not recaptioned due to the interleaved nature. Table source: Fang, et al., (2024) [11]

	MMC4	COYO	COYO-VILA₁	COYO-VILA₂	COYO-VILA₃	COYO-VILA₄
Avg #Words	17.1 ± 25.0	11.9 ± 9.0	101.2 ± 43.0	117.1 ± 49.1	126.77 ± 50.10	125.9 ± 51.2
VQAv2	N.A.	61.6	62.5	63.5	63.7	63.6

Motivated by this, VILA\(^2\) is proposed and introduces a VLM augmentation scheme that includes self-augment and specialist-augment steps to improve the data quality. On a high level, the captions corresponding to the images in the COYO dataset are first rewritten using the VILA model.

Such an approach to rewriting the image captions is similar to the BLIP model [12] except for two main differences. First, BLIP uses an additional filtering step to preprocess the data. Second, in VILA\(^2\), the caption rewriting process is repeated three times, resulting in the artifact models dubbed VILA\(_1\), VILA\(_2\), VILA\(_3\), … Such a process is called self-augmenting. Meanwhile, for BLIP, the captions are rewritten only once with a pre-trained image-grounded text decoder. Both BLIP and VILA\(^2\) models blend in human-annotated with the synthetic data. The VILA\(^2\) paper shows the improved benchmark when mixing the human-annotated and the synthetic data together. One problem arises when the performance of the model is bounded by the ability of the captioners. Moreover, to the fourth iteration of self-augmenting (in Table 7), there is no longer an increase in performance.

Table 7: VILA\(_i\)’s performance on different benchmarks after i\(^{th}\) iterations. Up to the fourth iteration, there is no further increase in performance, suggesting the need for a more specialized augmentation scheme. Table source: Fang et al. (2024) [11].

Method	VQAv2	GQA	SQA1	VQAT	POPE	LLaVAW	MM-Vet	MMMU
VILA0 - Baseline	79.6	62.4	68.4	61.6	84.2	68.4	34.5	33.8
VILA1	80.0	63.2	71.0	62.5	84.6	72.2	34.8	35.8
VILA2	80.8	63.5	71.5	63.5	84.7	71.2	34.9	35.2
VILA3	80.7	63.5	71.5	63.7	84.5	72.3	35.5	35.5
VILA4	80.7	63.4	71.2	63.6	85.0	72.3	35.5	35.0
VILA3 + Spatial Specialist	81.1	62.8	72.9	65.0	85.0	71.4	37.1	36.8

Thus, the authors propose specialist-augment method that requires the model so that it has better cappabilities by integrating specialist tasks for the model, including 1) spatial understanding, 2) grounding, and 3) OCR using different datasets.

An overview of specialist augmentation methods for VILA\(^2\). Figure from [11]

Table shows the ablation result on the effect of different specialist augmentation methods. Overall, specialist augmentation helps improve the model performance on several benchmarks.

Method	VQAv2	GQA	VQAT	POPE	SEED-I	MME	MM-Vet	MMMU
10% MMC4-core + 10% COYO-25M + ShareGPT4V-Pretrain
Original Caption	81.4	63.8	65.2	85.5	70.6	1472.5	34.0	31.8
+Spatial Specialist	81.9_↑0.5	64.1_↑0.3	66.0_↑0.8	85.9_↑0.4	71.8_↑1.2	1476.5_↑4.0	36.7_↑2.7	32.5_↑0.7
+OCR Specialist	81.8_↑0.4	64.0_↑0.2	65.3_↑0.1	86.4_↑0.9	72.1_↑1.5	1500.2_↑27.7	34.3_↑0.3	32.1_↑0.3
+Grounding Specialist	81.8_↑0.4	64.0_↑0.2	65.1_↓0.1	86.7_↑1.2	71.0_↑0.4	1536.4_↑63.9	37.5_↑3.5	32.6_↑0.8
MMC4-core + COYO-25M + ShareGPT4V-Pretrain
Original Caption	82.2	63.9	66.7	86.5	71.2	1518.2	42.6	33.4
+All 3 Specialists	83.0_↑0.8	64.7_↑0.8	70.9_↑4.2	86.4_↓0.1	74.0_↑2.8	1656.2_↑142	44.7_↑2.1	35.8_↑2.4

VILA\(^2\) is trained using a common three-stage align-pretrain-SFT paradigm, including initializing the projector, pre-training vision-language using both interleaved image-text data and image-text pair data (with the rewritten captions). Finally, visual instruction tuning is applied. VILA\(^2\) is ranked among the most performant models among open-source VLMs. The figure below compares the performance of VILA\(^2\) across various benchmark. The authors also show that data quality matters more than computation. In particular, VILA\(^2\) for one additional epoch does not improve the performance, while increasing the data quality (by rewriting the caption using VILA itself) significantly boosts the performance compared to training with one more epoch.

LongVILA

LongVILA [13] addresses the challenge of processing long context in terms of the pipeline and the system. On the pipeline side, LongVILA introduces two new training paradigms including long context extension and long video-supervised fine-tuning. On the system side, they propose long-context Multi-modal Sequence parallelism (MM-SP) to parallelize video training and inference to further utilize the GPU computation.

In LongVILA, there are five training stages; including 1) multimodal alignment; 2) large-scale pre-training; 3) short supervised fine-tuning; 4) context extension for LLMs; and 5) long supervised fine-tuning.

The first three stages are similar to that of the VILA model:

In stage 1, only the multimodal projector is trained.
In stage 2, the vision encoder is frozen and the LLM and multimodal projector is trained. VILA\(^2\) is applied to improve the dataset quality.
In stage 3, the model is fine-tuned with instruction-following data with images and short video datasets.
In stage 4, the LLM is pre-trained with text-only dataset to extend the context lengths. The LLM is progressively trained with different context length, ranging from 8192 to 262144.
In stage 5, the proposed MM-SP system is used to enhance the instruction-following ability, and all of the parameters are trainable in this stage.

Training data

LongVILA adopts the VILA\(^2\) method to first re-caption the images in the COYO dataset using the VILA model itself. In the supervised fine-tuning process, YouCook2 and ShareGPTVideo is used to train the model for short video comprehension.

Proportion of data categories in LongVILA dataset. Figure from [13].

In addition, a new dataset consisting of over 15 thousands videos is introduced for long video instruction following in stage 5 of the model training process.

VILA-U: A Unified VLM

At its core, VILA-U [15] is built upon two main principles: 1) creating an end-to-end autoregressive VLM that is able to generate images along with textual input alignment (instead of just trying to reconstruct the images); and 2) autoregressive image generators can create images that is of comparable quality to diffusion models as long as it is trained on a high-quality image-text dataset.

Overall architecture of VILA-U. The model is capable of both visual understanding and generation. Figure from [15].

While developing a generative VLM is not new, previous works often offload the generation task to a seperate model, either Diffusion model [16, 17] or combining VQ-GAN and LLM [18]. The key novelty in VILA-U is its unified approach to visual and text generation. It adopts a next-token prediction method for both text and visual content generation. Visual inputs are tokenized into discrete tokens and concatenated with textual tokens into a multi-modal token sequence. The visual input tokens are discretized using residual quantization. Both text and visual input tokens are, therefore, discrete, and thus we can train the LLM with next-token prediction task. The authors use special tokens: <image_start> and <image_end> for images; <video_start> and <video_end> for videos, with video tokens formed by concatenating image tokens.

This unified token sequence is then processed during training with a single next-token prediction objective, allowing the model to learn both text and visual generation in an integrated fashion.

Residual Vector Quantization

The visual inputs are first tokenized into discrete tokens, as shown in the figure below. These tokens are then concatenated with the textual tokens, forming a multimodal token sequence. Then, all of these tokens are used to train for next-token prediction task.

Overall architecture of the vision tower in VILA-U. Figure from [15]

To efficiently discretize the visual input tokens, the authors adopt a Residual Vector Quantization (RVQ) method as proposed in RQ-VAE. Given an input vector \(\bf{z}\), we want to represent it as a sequence of \(D\) discrete codes:

\[ RQ(z; C, D) = (k_1, \cdots, k_D) \in [K]^D \]

where \(C\) is a codebook that contains a set of \(K\) pre-defined vectors (also referred to as code), \(D\) is the number of quantization step, and \(k_d\) is the code selected from the codebook at step \(d\). Then, the RVQ breaks the vector \(\bf{z}\) into residuals across \(D\) steps, where at each step \(d\), it would first compute the code \(k_d\) by finding the closest match to it from the codebook:

\[ k_d = Q(r_{d-1}, C) = \underset{k \in [K]}{\text{argmin}} \|r_{d-1} - e(k)\|^2, \]

where \(e(k)\) is the embedding of the code \(k\), and \(r_{d-1}\) is the current residual. Then, we would update the residual as follows:

\[ r_d = r_{d-1} - e(k_d) \]

This step is repeated for \(d=1,2, \cdots, D\) steps, starting with \(r_0=z\). Finally, the approximated vector \(\hat{\bf{z}}\) of \(\bf{z}\) is computed as:

\[ \mathbf{\hat{z}} = \sum_{i=1}^D e(k_i) \]

Generative Pre-training

To achieve text and discrete visual token alignment, the authors propose using text-image contrastive loss along with VQ-based image reconstruction loss to equip the model with image generation capabilities.

However, while text-image contrastive loss requires the high-level understanding, reconstruction loss require low-level features. Therefore, naively training the vision tower from scratch to optimize those two objectives simultaneously would lead to conflicting goals. In fact, training a vision tower that way would lead to only 5% accuracy on zero-shot ImageNet classification. The training approach for the vision tower is, therefore, to first initialize it with CLIP weights for text-image alignment, and then freeze the text encoder to train only the visual encoder with both contrastive and reconstruction loss:

\[ \mathcal{L}_{\text{total}} = w_{\text{contra}} \mathcal{L}_{\text{contra}} + w_{\text{recon}} \mathcal{L}_{\text{recon}} \]

Various training approaches have been attempted, but each faces significant challenges. Loading pre-trained CLIP weights into the text encoder and RQ-VAE weights into the vision encoder and decoder while training the rest of the model from scratch fails due to the absence of pre-trained CLIP weights for the vision encoder. Another approach is to freeze the vision encoder prevents learning crucial low-level features needed for reconstruction. making the text encoder trainable leads to instability in the early stages as chaotic quantized features and contrastive loss disrupt its weights, slowing down training overall.

Since both visual tokens and text tokens are discrete, we can train our LLM with the general language modeling next-token prediction objective. However, due to the use of residual quantization for visual tokens, the training objectives for text and visual tokens differ slightly. For text tokens, the negative log-likelihood loss is calculated as:

\[ \mathcal{L}_{\text{text}} = -\sum_{i=1}^T \log P_\theta(y_i \mid y_{\lt i}), \]

where \(T\) is the length of the multi-modal sequence, and \(i\) only counts when the text token appears at position.

For visual tokens, residual quantization introduces a depth-stacked structure of codes at each visual position \(j\). Specifically, given the code embedding \(h_j\) generated by the LLM for visual tokens at position \(j\), the depth transformer autoregressively predicts \(D\) residual tokens \((k_{j1}, \cdots, k_{jD})\). During training, the input of the depth transformer \(v_{jd}\) at depth \(d\) is defined as the sum of the code embeddings of up to depth \(d-1\) for \(d>1\), such that \(v_{jd} = \sum_{d'=1}^{d-1} e(k_{jd'})\) and \(v_{j1} = h_j\). Thus, the depth transformer predicts the next code for a finer estimation of the feature \(\hat{z}_j\) based on the previous estimations up to \(d-1\). The negative log-likelihood loss for visual tokens is: \[ \mathcal{L}_{\text{visual}} = -\sum_{j=1}^T \sum_{d=1}^D \log P_\delta(k_{jd} \mid k_{j,\lt d}), \]

where \(T\) is the length of the multi-modal sequence, and \(j\) only counts when a visual token appears at position \(j\). During the multi-modal pre-training, the weights of the depth transformer are randomly initialized and updated together with the LLM.

Model	Pretrained Weights	Resolution	Shape of Code	rFID↓	Top-1 Accuracy↑
VQ-GAN	–	256 × 256	16 × 16	4.98	–
RQ-VAE	–	256 × 256	8 × 8 × 4	3.20	–
RQ-VAE	–	256 × 256	16 × 16 × 4	1.30	–
VILA-U	SigLIP-Large	256 × 256	16 × 16 × 4	1.80	73.3
VILA-U	SigLIP-SO400M	384 × 384	27 × 27 × 16	1.25	78.0

NVILA

NVILA [19] is a model built on top of VILA to improve the training efficiency regarding both algorithm and system side while maintaining the model’s performance. In the paper, the authors propose using data pruning and FP8 mixed precision training method.

Scale-and-compress scheme

The scaling is to preserve the visual features, while compress is to reduce the computation. The compressed visual tokens are expected to be information dense, as visual tokens are often found to be redundant [20, 21]. In this paper, the two types of visual token scaling are spatial and temporal, whe`re spatial means to increase the image resolution and temporal means to increase the number of frames sampled from the video.

Visual token compression

One way to compress the visual token is \(S^2\) [22]. Each image is first resized to be of different resolution: 448x448, 896x896, 1344x1344. Then, each image is splited into different tiles that have resolution of 448x448. The feature map of the tiles from the vision backbone will then be concatenated. However, resizing all of the image to be square might lead to distortion.

Model	Spatial Pooling	#Tokens/Tile	#Tiles/Image	AI2D	DocVQA	TextVQA	Image-10
Baseline (VILA-1.5)	2×2	256 (16×16)	1	87.0	61.3	67.5	61.2
Scale (Dynamic-S2)	2×2	256 (16×16)	9-12	90.1	91.1	77.0	71.5
Scale + Compress	3×3	121 (11×11)	1-12	87.4	82.3	74.1	67.1
Scale + Compress + VEP	3×3	121 (11×11)	1-12	89.8	88.8	76.1	70.8
Alternative Designs
TokenLearner	–	121	1-12	90.0	86.5	75.6	69.8
Perceiver Resampler	–	121	1-12	76.8	71.8	65.3	59.4

Therefore, Dynamic-\(S^2\) is proposed to resize the image to the resolution that is closest to the original solution that is divisible by 448. Using Dynamic-\(S^2\), the model performance can improve up to 30% on text-heavy tasks, as shown in Table 10.

Efficient training

Overall architecture of NVILA model. Figure from [19] .

The NVILA training process consists of five stages: projector initialization, visual encoder pre-training, token processor pre-training, image instruction-tuning, and video instruction-tuning. Stages 1, 3, and 4 are similar to VILA, while stage 2 focuses on maintaining model accuracy after visual token compression, and stage 5 improves the ability to process long videos. Similar to LongVILA, NVILA aims to improve the training efficiency both on the algorithm and the system side with data pruning using Deltaloss [23] and FP8 mixed precision technique. Given that the dataset used to train NVIA involved over 100M samples, data pruning is essential. In NVILA, Deltaloss is used to prune the data in an unsupervised way. The Deltaloss is defined as:

\[ D' = \bigcup_{i=1}^k \text{top-K} \left\{ \log \dfrac{p_{\text{large}}(x)}{p_{\text{small}}(x)} \big \vert x \in D_i\ \right\} \]

Here, \(D_i\) is the i-th subset of the fine-tuning dataset and \(D'\) is the pruned training dataset. According to the ablation study, 50% is a safe threshold in a way that it does not hurt accuracy so much while speeding up the training time to \(2 \times\).

FP8 training

FP8 precision offers significant benefits in computational and memory efficiency, supported by new NVIDIA GPUs like the H100 and B200. It accelerates training by performing matrix multiplications in FP8 and quantizing gradients, weights, and momentum, reducing memory usage and communication overhead.

Table. NVILA performance compared with other models. Table source: Liu et al. (2024).
GC	BS	Throughput	MMMU	Video-MME
✗	4	199.2 (1.0×)	47.9	52.9
✗	16	390.1 (2.0×)	47.0	53.0
✓	30	491.7 (2.5×)	47.8	53.1
✓	36	579.9 (2.9×)	47.7	53.0

For NVILA, FP8 allows increasing batch size, resulting in a 2x speedup. In VLM training, where sequence lengths vary widely (e.g., video tokens vs. image tokens), FP8 helps optimize performance. When gradient checkpointing is used, FP8 still provides a 1.2x speedup over BF16 training. Overall, NVILA is found to rivals with top-tier proprietary models like GPT-4o or Gemini 1.5. Pro, and sometime even surpasses them.

Acknowledgments

Thank you TMA Solutions for providing the necessary compute resources for VILA1.5-3B model inference.

References

Bordes F, Pang RY, Ajay A, et al (2024) An introduction to vision-language modeling

Ji, Yin H, Ping W, Molchanov P, Shoeybi M, Han S (2024) VILA: On pre-training for visual language models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR). pp 26689–26699

Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning. PMLR

Touvron H, Martin L, Stone K, et al (2023) Llama 2: Open foundation and fine-tuned chat models. arXiv

Han C, Wang Q, Cui Y, et al (2024) Facing the Elephant in the Room: Visual Prompt Tuning or Full finetuning? In: International Conference on Learning Representations (ICLR)

Zhu W, Hessel J, Awadalla A, et al (2023) Multimodal C4: An open, Billion-scale Corpus of Images Interleaved with Text. arXiv

Minwoo Byeon HK Beomhee Park (2022) COYO-700M: Image-text pair dataset

Alayrac J-B, Donahue J, Luc P, et al (2022) Flamingo: A visual language model for few-shot learning. In: Oh AH, Agarwal A, Belgrave D, Cho K (eds) Advances in neural information processing systems

Hessel J, Holtzman A, Forbes M, Bras RL, Choi Y (2021) CLIPScore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:210408718

10.

Gadre SY, Ilharco G, Fang A, et al (2023) DATAComp: In search of the next generation of multimodal datasets. In: Advances on neural information processing systems datasets and benchmarks track

11.

Fang Y, Zhu L, Lu Y, et al (2024) VILA\(^2\): VILA augmented VILA. arXiv preprint arXiv:240717453

12.

Li J, Li D, Xiong C, Hoi SCH (2022) BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International conference on machine learning (ICML). pp 12888–12900

13.

Xue F, Chen Y, Li D, et al (2024) LongVILA: Scaling long-context visual language models for long videos. arXiv preprint arXiv:240810188

14.

Gu D, Sun P, Hu Q, et al (2024) LoongTrain: Efficient Training of Long-Sequence LLMs with Head-Context Parallelism. arXiv

15.

Wu Y, Zhang Z, Chen J, et al (2024) VILA-U: A Unified Foundation Model Integrating Visual Understanding and Generation. arXiv

16.

Sun Q, Yu Q, Cui Y, et al (2024) Emu: Generative Pretraining in Multimodality. In: The Twelfth International Conference on Learning Representations

17.

Jin Y, Xu K, Xu K, et al (2024) Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization. The Twelfth International Conference on Learning Representations abs/2309.04669

18.

Yu L, Shi B, Pasunuru R, et al (2023) Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning. arXiv abs/2309.02591

19.

Liu Z, Zhu L, Shi B, et al (2024) NVILA: Efficient Frontier Visual Language Models. arXiv

20.

Liu T, Shi L, Hong R, Hu Y, Yin Q, Zhang L (2024) Multi-Stage Vision Token Dropping: Towards Efficient Multimodal Large Language Model. arXiv

21.

Xing L, Huang Q, Dong X, et al (2024) PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction. arXiv

22.

Shi B, Wu Z, Mao M, Wang X, Darrell T (2025) When Do We Not Need Larger Vision Models? In: European Conference on Computer Vision (ECCV). Springer Nature Switzerland, pp 444–462

23.

Gu Y, Zhou H, Meng F, Zhou J, Huang M (2024) MiniPLM: Knowledge Distillation for Pre-Training Language Models. arXiv

Footnotes

See my blog post on training data for VLM.↩︎