The VILA Model Family

A class of Vision-Language models designed with efficiency in mind.
vlm
Published

October 17, 2024

Modified

February 10, 2024


Change log

For a long time, traditional ML models focus only on one modality, such as text, image, or audio. As humans, we are capable of dealing with multiple modalities at the same time. The human experience is multimodal. We see, hear, smell, and touch. In addition, the Internet and many applications nowadays are multimodal. For example, we can rarely see any webpages that contain only text or images, and we hardly watch any video without any sound. The explosion of Internet data comes with vast data resources and availability that we can use to train advanced AI models.

A sample video description generated by the VILA1.5-3b model running on Jetson Orin NX.

While there are many different modalities that have been considered in the ML community, in this blog post, we will focus on vision and text modality. There are many different ways that we can use to categorize Vision-Language Models, either based on their training paradigms, input/output modalities, or pre-training objectives. Following Bordes et al. (2024) [1], one way of categorizing different VLMs is via their training paradigms, which include 1) contrastive; 2) masking; 3) generative; and 4) using pre-trained backbones. These methods are, however, not mutually exclusive, there have been work that combine two or more training paradigms. Among the most popular VLMs, VILA [2] and its variants are a family of models that focus on efficiency while maintaining impressive performance.

VILA


Given the explosion of LLMs in recent years, it is natural to come up with ways to extend such LLMs to be able to work with the visual modality (eg images and videos). VILA [2] is a VLM introduced by NVIDIA and MIT that is based on such approach by integrating visual and textual information. It is pre-trained based on the recipes that the authors observe. At its core, VILA consists of three main components, including 1) a visual encoder, 2) an LLM, and 3) a projector. The vision backbone is CLIP visual encoder [3] and the LLM used is Llama2-7b and Llama2-13b [4]. A projector acts as a mapping to connect the images and text embeddings onto the same space, it can either be as simple as a linear layer or MLP, or as complicated as a Transformer block. VILA takes visual and text inputs and generates text outputs. This section outlines several design choices to improve the model’s performance based on empirical observation.

Full fine-tuning or visual prompt tuning ?

The two common methods that have often been used to augment LLMs with visual inputs is either to 1) fine-tune the LLM with the visual tokens (also referred to as (full) fine-tuning) or 2) only train the visual input while freezing the weights of the LLM (often known as (visual) prompt tuning). The prompt-tuning approach treats the visual input projector as a way to “prompt” the LLM with visual embeddings while keeping the text-only model intact. It can be thought of as a “translator” between the vision encoder and the text-only LLM, effectively acting as the “prompt.”

The question of whether to use the former or latter approach comes with different tradeoffs. While fine-tuning is essential for the VLM to inherit the LLMs’ in-context learning ability, it comes with the risk of catastrophic forgetting, where the model’s ability degrades on text-only tasks. Such a problem can be prevented by using prompt tuning, however, this leads to poor performance and degrades in-context learning capabilities.

Table 1: Performance comparison of different LLM training settings with varying projector types and fine-tuning strategies. 0s denotes zero-shot and 4s denotes 4-shot setting.
Setting PreT SFT Projector OKVQA (0s) OKVQA (4s) TextVQA (0s) TextVQA (4s) COCO (0s) COCO (4s) Flickr (0s) Flickr (4s) Average (0s) Average (4s)
a) Transformer 10.4 19.2 14.8 23.1 17.4 60.2 11.0 47.4 13.4 37.5
b) Transformer 47.1 47.7 37.2 36.6 109.4 88.0 73.6 58.1 66.8 57.6
c) Transformer 44.8 49.8 38.5 38.8 112.3 113.5 71.5 72.9 66.8 68.8
d) Linear 45.2 50.3 39.7 40.2 115.7 118.5 74.2 74.7 68.7 70.9

Full fine-tuning is important. In the VILA paper, the authors did experiments with different settings (A-D, where setting a) and b) focus on prompt tuning (LLM is frozen); c) and d) involve fine-tuning (LLM is updated). Table 1 shows that freezing the LLM during pre-training does not negatively impact zero-shot accuracy, but it results in a reduced ability for in-context learning. Another observation from this experiment is that using a simple projector (linear layer) leads to better performance. The hypothesis is that using a simple projector enforces the LLM to learn more from visual tokens, and thus the model can generalize better.

Left. Visual Prompt Tuning freezes the LLMs weights in \(N\) layers \(L_1, L_2, \cdots, L_N\), updating only a subset of parameters, which lead to in-context ability degradation. Middle. Full Fine-Tuning updates the weights of the LLM in every layer. Right. Bar chart comparing their performance in 0-shot (0s) and 4-shot (4s) settings. Figure adapted from Cheng et al. (2024) [5], bar chart based on Lin et al. (2024)’s empirical results [2].

The benefits of full finetuning approach is, however, depends on the size of the downstream dataset size, as shown by Cheng et al. (2024) in Figure 1.

Figure 1: VTAB-1k analysis across Natural (left), Specialized (middle), and Structured (right) tasks. FT (solid) vs. VPT (dotted) accuracy shows FT outperforming VPT in 9/12 cases as dataset size increases.

Training Data

Training data is one of the most important aspects of training VLMs 1. Some common practices to preprocess the image-text data are to 1) remove the duplicate, 2) balance the labels, 3) prune (removing the data pairs where the captions do not match with the image), and 4) improve the captions with other LLMs.

The two most common types of data used for training VLMs include interleaved image-text data and image-text pair data (i.e. the images and their corresponding captions). In the VILA paper, the authors train the model with different types of data to analyze their influence on the model’s performance.

Interleaved or Image-text pair Data ?

Figure 2. An example of the difference between interleaved image-text data (top) and image-text pair data (bottom).

The VILA paper shows that interleaved image-text data like MMC4 [6] is more effective regarding training VLM because it prevents the model from catastrophic forgetting, which can degrade the model’s text capability. For example, using COYO [7], an image-text pair dataset for pretraining can lead to text-only accuracy (MMLU) degrading by 17.2%, while training the model using MMC4 interleaved dataset degrades text-only accuracy by only 5%. This degradation gap can further be addressed when using a larger LLM. A similar conclusion is provided in the Flamingo paper [8], where interleaved data is found to be better to reserve the model’s text-only capability, as shown in Table 2.

Table 2: Ablated study on pre-training data for Flamingo-3B model. Without training the model on M3W, an interleaved image-text data, the model overall performance drops significantly. Table Source: Alayrac et al., (2022) [8].
Ablated Setting Step COCO
CIDEr ↑
OKVQA
top1 ↑
VQAv2
top1 ↑
MSVDQA
top1 ↑
VATEX
CIDEr ↑
Overall
score ↑
Flamingo-3B model 1.74s 86.5 42.1 55.8 36.3 53.4 70.7
w/o Video-Text pairs 1.42s 84.2 43.0 53.9 34.5 46.0 67.3
w/o Image-Text pairs 0.95s 66.3 39.2 51.6 32.0 41.6 60.9
Image-Text pairs → LAION 1.74s 79.5 41.4 53.5 33.9 47.6 66.4
w/o M3W 1.02s 54.1 36.5 52.7 31.4 23.5 53.4

Given this observation, the authors propose Joint Supervised Fine-tuning (SFT), which blends text-only instruction data with visual instruction-tuning data to improve the model’s textual ability. Such training technique, according to the ablation study, reduces the degradation in text-only accuracy while improving the visual language accuracy at the same time, as shown in Table 3.

Table 3: Ablation study for Joint SFT. Additionally adding text instruction tuning data for SFT helps reduce the model drop on MMLU benchmark. Table Source: Lin et al., (2024) [2].
PT Data SFT Data VLM Acc. (avg) MMLU Acc.
Llama-2 - - 46.0%
MMC4 Visual 68.7% / 70.9% 40.7% (-5.3%)
MMC4+COYO Visual 69.0% / 71.3% 40.2% (-5.8%)
Llama-2 Text - 51.2%
MMC4 Vis.+Text 71.0% / 72.1% 51.4% (+0.2%)
MMC4+COYO Vis.+Text 72.3% / 73.6% 50.9% (-0.3%)

Image resolution or number of tokens ?

TL;DR: Image resolution

In VILA experiments, as shown in Table 4, increasing the image resolution from 224 to 336 increases the TextVQA benchmark from 41.6% to 49.8%. The tradeoff is that a higher-resolution image leads to more tokens per image (336x336 corresponds to 576 tokens/image) and a higher computational cost, which is even worse for video understanding given the limited context length.

Table 4: Ablation study for image resolution. Table Source: Lin et al., (2024) [2].
Resolution Projector #Tokens OKVQA TextVQA COCO
224 linear 256 49.9% 41.6% 116.0
336 linear 576 49.7% 49.8% 117.7
336 downsample 144 49.3% 45.6% 115.7

However, the raw resolution matters more than the number of visual tokens/images. The authors’ solution is to reduce the number of visual tokens using a downsample projector that concatenated every 2x2 token into a single one and used the linear layer to fuse the information. Using such a downsample projector improves the TextVQA accuracy from 41.6% to 45.6%, compared to 49.8% when we naively feed the 336x336 image.

Data quality or quantity?

TL;DR: Data quality

In VILA experiments, scaling up the dataset size from 25M to 50M doesn’t give much benefit, but adding 1M of high-quality data improves the benchmark results. The data quality is quantified using the CLIPScore [9].

Table 5: Effect of data quality on ImageNet accuracy. Training on DATAComp-1B achieves the highest accuracy with ~60.8% less data.
Dataset Dataset size # samples seen Architecture Train compute (MACs) ImageNet accuracy
OpenAI’s WIT 0.4B 13B ViT-L/14 1.1 × 10²¹ 75.5%
LAION-400M 0.4B 13B ViT-L/14 1.1 × 10²¹ 72.8%
LAION-2B 2.3B 13B ViT-L/14 1.1 × 10²¹ 73.1%
LAION-2B 2.3B 34B ViT-H/14 6.5 × 10²¹ 78.0%
LAION-2B 2.3B 34B ViT-g/14 9.9 × 10²¹ 78.5%
DATAComp-1B 1.4B 13B ViT-L/14 1.1 × 10²¹ 79.2%

There exists another benchmark such as DATAComp [10] with the same observation that data quality matters more than quantity regarding training VLMs. Table 5 from the DATAComp paper shows the effect of data quality on the ImageNet classification task.

VILA\(^2\)


Data curation for training vision-language models (VLMs) is both costly and challenging, often limited by closed-source model APIs or the use of noisy web-crawled data. When data is sourced from the internet, its quality cannot be reliably ensured. One problem with Internet-crawled and available public image-text datasets such as MMC4 [6] and COYO [7] is that the captions associated with the images are often of short length and that human descriptions of the image often lack semantic elements (which is essential for LLM). shows the average caption length. It can roughly be inferred that the longer the caption length, the better the performance of the models on VQAv2 benchmark. Note that the MMC4 dataset is not re-captioned because of its interleaved nature.

Table 6: The effect of caption length on the model performance on VQAv2 benchmark. Only the COYO dataset is recaptioned. The MMC4 dataset is not recaptioned due to the interleaved nature. Table source: Fang, et al., (2024) [11]
MMC4 COYO COYO-VILA₁ COYO-VILA₂ COYO-VILA₃ COYO-VILA₄
Avg #Words 17.1 ± 25.0 11.9 ± 9.0 101.2 ± 43.0 117.1 ± 49.1 126.77 ± 50.10 125.9 ± 51.2
VQAv2 N.A. 61.6 62.5 63.5 63.7 63.6

Motivated by this, VILA\(^2\) is proposed and introduces a VLM augmentation scheme that includes self-augment and specialist-augment steps to improve the data quality. On a high level, the captions corresponding to the images in the COYO dataset are first rewritten using the VILA model.

Such an approach to rewriting the image captions is similar to the BLIP model [12] except for two main differences. First, BLIP uses an additional filtering step to preprocess the data. Second, in VILA\(^2\), the caption rewriting process is repeated three times, resulting in the artifact models dubbed VILA\(_1\), VILA\(_2\), VILA\(_3\), … Such a process is called self-augmenting. Meanwhile, for BLIP, the captions are rewritten only once with a pre-trained image-grounded text decoder. Both BLIP and VILA\(^2\) models blend in human-annotated with the synthetic data. The VILA\(^2\) paper shows the improved benchmark when mixing the human-annotated and the synthetic data together. One problem arises when the performance of the model is bounded by the ability of the captioners. Moreover, to the fourth iteration of self-augmenting (in Table 7), there is no longer an increase in performance.

Table 7: VILA\(_i\)’s performance on different benchmarks after i\(^{th}\) iterations. Up to the fourth iteration, there is no further increase in performance, suggesting the need for a more specialized augmentation scheme. Table source: Fang et al. (2024) [11].
Method VQAv2 GQA SQA1 VQAT POPE LLaVAW MM-Vet MMMU
VILA0 - Baseline 79.6 62.4 68.4 61.6 84.2 68.4 34.5 33.8
VILA1 80.0 63.2 71.0 62.5 84.6 72.2 34.8 35.8
VILA2 80.8 63.5 71.5 63.5 84.7 71.2 34.9 35.2
VILA3 80.7 63.5 71.5 63.7 84.5 72.3 35.5 35.5
VILA4 80.7 63.4 71.2 63.6 85.0 72.3 35.5 35.0
VILA3 + Spatial Specialist 81.1 62.8 72.9 65.0 85.0 71.4 37.1 36.8

Thus, the authors propose specialist-augment method that requires the model so that it has better cappabilities by integrating specialist tasks for the model, including 1) spatial understanding, 2) grounding, and 3) OCR using different datasets.

An overview of specialist augmentation methods for VILA\(^2\). Figure from [11]

Table shows the ablation result on the effect of different specialist augmentation methods. Overall, specialist augmentation helps improve the model performance on several benchmarks.

Method VQAv2 GQA VQAT POPE SEED-I MME MM-Vet MMMU
10% MMC4-core + 10% COYO-25M + ShareGPT4V-Pretrain
Original Caption 81.4 63.8 65.2 85.5 70.6 1472.5 34.0 31.8
+Spatial Specialist 81.9↑0.5 64.1↑0.3 66.0↑0.8 85.9↑0.4 71.8↑1.2 1476.5↑4.0 36.7↑2.7 32.5↑0.7
+OCR Specialist 81.8↑0.4 64.0↑0.2 65.3↑0.1 86.4↑0.9 72.1↑1.5 1500.2↑27.7 34.3↑0.3 32.1↑0.3
+Grounding Specialist 81.8↑0.4 64.0↑0.2 65.1↓0.1 86.7↑1.2 71.0↑0.4 1536.4↑63.9 37.5↑3.5 32.6↑0.8
MMC4-core + COYO-25M + ShareGPT4V-Pretrain
Original Caption 82.2 63.9 66.7 86.5 71.2 1518.2 42.6 33.4
+All 3 Specialists 83.0↑0.8 64.7↑0.8 70.9↑4.2 86.4↓0.1 74.0↑2.8 1656.2↑142 44.7↑2.1 35.8↑2.4

VILA\(^2\) is trained using a common three-stage align-pretrain-SFT paradigm, including initializing the projector, pre-training vision-language using both interleaved image-text data and image-text pair data (with the rewritten captions). Finally, visual instruction tuning is applied. VILA\(^2\) is ranked among the most performant models among open-source VLMs. The figure below compares the performance of VILA\(^2\) across various benchmark. The authors also show that data quality matters more than computation. In particular, VILA\(^2\) for one additional epoch does not improve the performance, while increasing the data quality (by rewriting the caption using VILA itself) significantly boosts the performance compared to training with one more epoch.

LongVILA


LongVILA [13] addresses the challenge of processing long context in terms of the pipeline and the system. On the pipeline side, LongVILA introduces two new training paradigms including long context extension and long video-supervised fine-tuning. On the system side, they propose long-context Multi-modal Sequence parallelism (MM-SP) to parallelize video training and inference to further utilize the GPU computation.

Overview of LongVILA training stages. Figure from [13]

In LongVILA, there are five training stages; including 1) multimodal alignment; 2) large-scale pre-training; 3) short supervised fine-tuning; 4) context extension for LLMs; and 5) long supervised fine-tuning.

The first three stages are similar to that of the VILA model:

  • In stage 1, only the multimodal projector is trained.

  • In stage 2, the vision encoder is frozen and the LLM and multimodal projector is trained. VILA\(^2\) is applied to improve the dataset quality.

  • In stage 3, the model is fine-tuned with instruction-following data with images and short video datasets.

  • In stage 4, the LLM is pre-trained with text-only dataset to extend the context lengths. The LLM is progressively trained with different context length, ranging from 8192 to 262144.

  • In stage 5, the proposed MM-SP system is used to enhance the instruction-following ability, and all of the parameters are trainable in this stage.

Training data

LongVILA adopts the VILA\(^2\) method to first re-caption the images in the COYO dataset using the VILA model itself. In the supervised fine-tuning process, YouCook2 and ShareGPTVideo is used to train the model for short video comprehension.

Proportion of data categories in LongVILA dataset. Figure from [13].

In addition, a new dataset consisting of over 15 thousands videos is introduced for long video instruction following in stage 5 of the model training process.

Multi-modal Sequence Parallelism

Previous methods of processing long context length have two limitations, one is modality heterogeneity and the other is networking heterogeneity. For Modality Heterogeneity, the text tokens are easy and fast to process, while visual tokens (like images or video frames) are more complex and require more resources. This can lead to an imbalance in GPU workloads, where some GPUs do much more work than others. For Networking Heterogeneity, GPUs that are close together (within the same machine) communicate very fast using NVLink, but GPUs that are far apart (across different machines) communicate more slowly using InfiniBand. This speed difference can create bottlenecks, slowing down the system.

Overview of the Multimodal Sequence Parallelism method. Figure from [13].

The two keys techniques adopted to address the aforementioned challenges are 2D-Attention [14] and two-stage sharding to distribute the work more evenly across multiple GPUs. The 2D-attention mechanism optimizes communication by organizing GPUs into a two dimensional mesh: intra-node communication is handled by splitting data along the attention head dimension, while inter-node communication transfers partitioned data chunks between GPUs. This structure ensures efficient and balanced computation.

To handle modality heterogeneity, the two-stage sharding strategy first distributes images evenly across GPUs for the image encoding stage. Then, in the second stage, it aggregates vision and text inputs, distributing tokens evenly for language modeling. This approach balances workloads across GPUs and maintains consistency using techniques like dummy token padding. For inference, the authors implement a system where all GPUs operate concurrently, avoiding bottlenecks in traditional methods like HuggingFace Pipeline Parallelism. This allows for efficient use of memory and computation, enabling the handling of longer sequences and faster processing.

VILA-U: A Unified VLM


At its core, VILA-U [15] is built upon two main principles: 1) creating an end-to-end autoregressive VLM that is able to generate images along with textual input alignment (instead of just trying to reconstruct the images); and 2) autoregressive image generators can create images that is of comparable quality to diffusion models as long as it is trained on a high-quality image-text dataset.

Overall architecture of VILA-U. The model is capable of both visual understanding and generation. Figure from [15].

While developing a generative VLM is not new, previous works often offload the generation task to a seperate model, either Diffusion model [16, 17] or combining VQ-GAN and LLM [18]. The key novelty in VILA-U is its unified approach to visual and text generation. It adopts a next-token prediction method for both text and visual content generation. Visual inputs are tokenized into discrete tokens and concatenated with textual tokens into a multi-modal token sequence. The visual input tokens are discretized using residual quantization. Both text and visual input tokens are, therefore, discrete, and thus we can train the LLM with next-token prediction task. The authors use special tokens: <image_start> and <image_end> for images; <video_start> and <video_end> for videos, with video tokens formed by concatenating image tokens.

This unified token sequence is then processed during training with a single next-token prediction objective, allowing the model to learn both text and visual generation in an integrated fashion.

Residual Vector Quantization

The visual inputs are first tokenized into discrete tokens, as shown in the figure below. These tokens are then concatenated with the textual tokens, forming a multimodal token sequence. Then, all of these tokens are used to train for next-token prediction task.

Overall architecture of the vision tower in VILA-U. Figure from [15]

To efficiently discretize the visual input tokens, the authors adopt a Residual Vector Quantization (RVQ) method as proposed in RQ-VAE. Given an input vector \(\bf{z}\), we want to represent it as a sequence of \(D\) discrete codes:

\[ RQ(z; C, D) = (k_1, \cdots, k_D) \in [K]^D \]

where \(C\) is a codebook that contains a set of \(K\) pre-defined vectors (also referred to as code), \(D\) is the number of quantization step, and \(k_d\) is the code selected from the codebook at step \(d\). Then, the RVQ breaks the vector \(\bf{z}\) into residuals across \(D\) steps, where at each step \(d\), it would first compute the code \(k_d\) by finding the closest match to it from the codebook:

\[ k_d = Q(r_{d-1}, C) = \underset{k \in [K]}{\text{argmin}} \|r_{d-1} - e(k)\|^2, \]

where \(e(k)\) is the embedding of the code \(k\), and \(r_{d-1}\) is the current residual. Then, we would update the residual as follows:

\[ r_d = r_{d-1} - e(k_d) \]

This step is repeated for \(d=1,2, \cdots, D\) steps, starting with \(r_0=z\). Finally, the approximated vector \(\hat{\bf{z}}\) of \(\bf{z}\) is computed as:

\[ \mathbf{\hat{z}} = \sum_{i=1}^D e(k_i) \]

Generative Pre-training

To achieve text and discrete visual token alignment, the authors propose using text-image contrastive loss along with VQ-based image reconstruction loss to equip the model with image generation capabilities.

However, while text-image contrastive loss requires the high-level understanding, reconstruction loss require low-level features. Therefore, naively training the vision tower from scratch to optimize those two objectives simultaneously would lead to conflicting goals. In fact, training a vision tower that way would lead to only 5% accuracy on zero-shot ImageNet classification. The training approach for the vision tower is, therefore, to first initialize it with CLIP weights for text-image alignment, and then freeze the text encoder to train only the visual encoder with both contrastive and reconstruction loss:

\[ \mathcal{L}_{\text{total}} = w_{\text{contra}} \mathcal{L}_{\text{contra}} + w_{\text{recon}} \mathcal{L}_{\text{recon}} \]

Various training approaches have been attempted, but each faces significant challenges. Loading pre-trained CLIP weights into the text encoder and RQ-VAE weights into the vision encoder and decoder while training the rest of the model from scratch fails due to the absence of pre-trained CLIP weights for the vision encoder. Another approach is to freeze the vision encoder prevents learning crucial low-level features needed for reconstruction. making the text encoder trainable leads to instability in the early stages as chaotic quantized features and contrastive loss disrupt its weights, slowing down training overall.

Since both visual tokens and text tokens are discrete, we can train our LLM with the general language modeling next-token prediction objective. However, due to the use of residual quantization for visual tokens, the training objectives for text and visual tokens differ slightly. For text tokens, the negative log-likelihood loss is calculated as:

\[ \mathcal{L}_{\text{text}} = -\sum_{i=1}^T \log P_\theta(y_i \mid y_{\lt i}), \]

where \(T\) is the length of the multi-modal sequence, and \(i\) only counts when the text token appears at position.

For visual tokens, residual quantization introduces a depth-stacked structure of codes at each visual position \(j\). Specifically, given the code embedding \(h_j\) generated by the LLM for visual tokens at position \(j\), the depth transformer autoregressively predicts \(D\) residual tokens \((k_{j1}, \cdots, k_{jD})\). During training, the input of the depth transformer \(v_{jd}\) at depth \(d\) is defined as the sum of the code embeddings of up to depth \(d-1\) for \(d>1\), such that \(v_{jd} = \sum_{d'=1}^{d-1} e(k_{jd'})\) and \(v_{j1} = h_j\). Thus, the depth transformer predicts the next code for a finer estimation of the feature \(\hat{z}_j\) based on the previous estimations up to \(d-1\). The negative log-likelihood loss for visual tokens is: \[ \mathcal{L}_{\text{visual}} = -\sum_{j=1}^T \sum_{d=1}^D \log P_\delta(k_{jd} \mid k_{j,\lt d}), \]

where \(T\) is the length of the multi-modal sequence, and \(j\) only counts when a visual token appears at position \(j\). During the multi-modal pre-training, the weights of the depth transformer are randomly initialized and updated together with the LLM.

Model Pretrained Weights Resolution Shape of Code rFID↓ Top-1 Accuracy↑
VQ-GAN 256 × 256 16 × 16 4.98
RQ-VAE 256 × 256 8 × 8 × 4 3.20
RQ-VAE 256 × 256 16 × 16 × 4 1.30
VILA-U SigLIP-Large 256 × 256 16 × 16 × 4 1.80 73.3
VILA-U SigLIP-SO400M 384 × 384 27 × 27 × 16 1.25 78.0

NVILA

NVILA [19] is a model built on top of VILA to improve the training efficiency regarding both algorithm and system side while maintaining the model’s performance. In the paper, the authors propose using data pruning and FP8 mixed precision training method.

Scale-and-compress scheme

The scaling is to preserve the visual features, while compress is to reduce the computation. The compressed visual tokens are expected to be information dense, as visual tokens are often found to be redundant [20, 21]. In this paper, the two types of visual token scaling are spatial and temporal, whe`re spatial means to increase the image resolution and temporal means to increase the number of frames sampled from the video.

Visual token compression

One way to compress the visual token is \(S^2\) [22]. Each image is first resized to be of different resolution: 448x448, 896x896, 1344x1344. Then, each image is splited into different tiles that have resolution of 448x448. The feature map of the tiles from the vision backbone will then be concatenated. However, resizing all of the image to be square might lead to distortion.

Model Spatial Pooling #Tokens/Tile #Tiles/Image AI2D DocVQA TextVQA Image-10
Baseline (VILA-1.5) 2×2 256 (16×16) 1 87.0 61.3 67.5 61.2
Scale (Dynamic-S2) 2×2 256 (16×16) 9-12 90.1 91.1 77.0 71.5
Scale + Compress 3×3 121 (11×11) 1-12 87.4 82.3 74.1 67.1
Scale + Compress + VEP 3×3 121 (11×11) 1-12 89.8 88.8 76.1 70.8
Alternative Designs
TokenLearner 121 1-12 90.0 86.5 75.6 69.8
Perceiver Resampler 121 1-12 76.8 71.8 65.3 59.4

Therefore, Dynamic-\(S^2\) is proposed to resize the image to the resolution that is closest to the original solution that is divisible by 448. Using Dynamic-\(S^2\), the model performance can improve up to 30% on text-heavy tasks, as shown in Table 10.

Efficient training

Overall architecture of NVILA model. Figure from [19] .

The NVILA training process consists of five stages: projector initialization, visual encoder pre-training, token processor pre-training, image instruction-tuning, and video instruction-tuning. Stages 1, 3, and 4 are similar to VILA, while stage 2 focuses on maintaining model accuracy after visual token compression, and stage 5 improves the ability to process long videos. Similar to LongVILA, NVILA aims to improve the training efficiency both on the algorithm and the system side with data pruning using Deltaloss [23] and FP8 mixed precision technique. Given that the dataset used to train NVIA involved over 100M samples, data pruning is essential. In NVILA, Deltaloss is used to prune the data in an unsupervised way. The Deltaloss is defined as:

\[ D' = \bigcup_{i=1}^k \text{top-K} \left\{ \log \dfrac{p_{\text{large}}(x)}{p_{\text{small}}(x)} \big \vert x \in D_i\ \right\} \]

Here, \(D_i\) is the i-th subset of the fine-tuning dataset and \(D'\) is the pruned training dataset. According to the ablation study, 50% is a safe threshold in a way that it does not hurt accuracy so much while speeding up the training time to \(2 \times\).

FP8 training

FP8 precision offers significant benefits in computational and memory efficiency, supported by new NVIDIA GPUs like the H100 and B200. It accelerates training by performing matrix multiplications in FP8 and quantizing gradients, weights, and momentum, reducing memory usage and communication overhead.

GC BS Throughput MMMU Video-MME
4 199.2 (1.0×) 47.9 52.9
16 390.1 (2.0×) 47.0 53.0
30 491.7 (2.5×) 47.8 53.1
36 579.9 (2.9×) 47.7 53.0
Table. NVILA performance compared with other models. Table source: Liu et al. (2024).

For NVILA, FP8 allows increasing batch size, resulting in a 2x speedup. In VLM training, where sequence lengths vary widely (e.g., video tokens vs. image tokens), FP8 helps optimize performance. When gradient checkpointing is used, FP8 still provides a 1.2x speedup over BF16 training. Overall, NVILA is found to rivals with top-tier proprietary models like GPT-4o or Gemini 1.5. Pro, and sometime even surpasses them.

Acknowledgments

Thank you TMA Solutions for providing the necessary compute resources for VILA1.5-3B model inference.

References

1.
Bordes F, Pang RY, Ajay A, et al (2024) An introduction to vision-language modeling
2.
Ji, Yin H, Ping W, Molchanov P, Shoeybi M, Han S (2024) VILA: On pre-training for visual language models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR). pp 26689–26699
3.
Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning. PMLR
4.
Touvron H, Martin L, Stone K, et al (2023) Llama 2: Open foundation and fine-tuned chat models. arXiv
5.
Han C, Wang Q, Cui Y, et al (2024) Facing the Elephant in the Room: Visual Prompt Tuning or Full finetuning? In: International Conference on Learning Representations (ICLR)
6.
Zhu W, Hessel J, Awadalla A, et al (2023) Multimodal C4: An open, Billion-scale Corpus of Images Interleaved with Text. arXiv
7.
Minwoo Byeon HK Beomhee Park (2022) COYO-700M: Image-text pair dataset
8.
Alayrac J-B, Donahue J, Luc P, et al (2022) Flamingo: A visual language model for few-shot learning. In: Oh AH, Agarwal A, Belgrave D, Cho K (eds) Advances in neural information processing systems
9.
Hessel J, Holtzman A, Forbes M, Bras RL, Choi Y (2021) CLIPScore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:210408718
10.
Gadre SY, Ilharco G, Fang A, et al (2023) DATAComp: In search of the next generation of multimodal datasets. In: Advances on neural information processing systems datasets and benchmarks track
11.
Fang Y, Zhu L, Lu Y, et al (2024) VILA\(^2\): VILA augmented VILA. arXiv preprint arXiv:240717453
12.
Li J, Li D, Xiong C, Hoi SCH (2022) BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International conference on machine learning (ICML). pp 12888–12900
13.
Xue F, Chen Y, Li D, et al (2024) LongVILA: Scaling long-context visual language models for long videos. arXiv preprint arXiv:240810188
14.
Gu D, Sun P, Hu Q, et al (2024) LoongTrain: Efficient Training of Long-Sequence LLMs with Head-Context Parallelism. arXiv
15.
Wu Y, Zhang Z, Chen J, et al (2024) VILA-U: A Unified Foundation Model Integrating Visual Understanding and Generation. arXiv
16.
Sun Q, Yu Q, Cui Y, et al (2024) Emu: Generative Pretraining in Multimodality. In: The Twelfth International Conference on Learning Representations
17.
Jin Y, Xu K, Xu K, et al (2024) Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization. The Twelfth International Conference on Learning Representations abs/2309.04669
18.
Yu L, Shi B, Pasunuru R, et al (2023) Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning. arXiv abs/2309.02591
19.
Liu Z, Zhu L, Shi B, et al (2024) NVILA: Efficient Frontier Visual Language Models. arXiv
20.
Liu T, Shi L, Hong R, Hu Y, Yin Q, Zhang L (2024) Multi-Stage Vision Token Dropping: Towards Efficient Multimodal Large Language Model. arXiv
21.
Xing L, Huang Q, Dong X, et al (2024) PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction. arXiv
22.
Shi B, Wu Z, Mao M, Wang X, Darrell T (2025) When Do We Not Need Larger Vision Models? In: European Conference on Computer Vision (ECCV). Springer Nature Switzerland, pp 444–462
23.
Gu Y, Zhou H, Meng F, Zhou J, Huang M (2024) MiniPLM: Knowledge Distillation for Pre-Training Language Models. arXiv

Footnotes

  1. See my blog post on training data for VLM.↩︎