Instructions to use nvidia/CUDA-Autocomplete with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use nvidia/CUDA-Autocomplete with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="nvidia/CUDA-Autocomplete")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("nvidia/CUDA-Autocomplete")
model = AutoModelForMultimodalLM.from_pretrained("nvidia/CUDA-Autocomplete")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use nvidia/CUDA-Autocomplete with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "nvidia/CUDA-Autocomplete"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/CUDA-Autocomplete",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/nvidia/CUDA-Autocomplete

SGLang

How to use nvidia/CUDA-Autocomplete with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "nvidia/CUDA-Autocomplete" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/CUDA-Autocomplete",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "nvidia/CUDA-Autocomplete" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/CUDA-Autocomplete",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use nvidia/CUDA-Autocomplete with Docker Model Runner:
```
docker model run hf.co/nvidia/CUDA-Autocomplete
```

Model Overview

NVIDIA CUDA Autocomplete is a fine-tuned version of Qwen/Qwen2.5-Coder-7B enhanced for CUDA code completion. The model takes as input two strings of code context: the prefix (code before the cursor) and the suffix (code after the cursor), and outputs several lines of code that logically continues the prefix. By analyzing the surrounding code structure, variable names, and CUDA-specific patterns, the model predicts the most likely next line of code, enabling intelligent autocomplete functionality for general programming and CUDA development in the Nsight Copilot extension for VSCode and Cursor.

This model is ready for commercial/non-commercial use.

License/Terms of Use

Use of this model is governed by the NVIDIA Open Model License Agreement.

Additional Information.For Qwen2.5-Coder-7B, Apache License, Version 2.0.

Deployment Geography

Global

Use Case

This model is intended to be used for code completion in the Nsight Copilot extension for VSCode / Cursor.

Release Date

Huggingface : 06/09/2026 via https://huggingface.co/nvidia/CUDA-Autocomplete

Reference(s)

Qwen2.5-Coder paper Qwen2.5-Coder blog Qwen2.5-Coder GitHub repository

Model Architecture

Architecture Type: Transformer Network Architecture: Qwen2ForCausalLM This model was developed based on Qwen/Qwen2.5-Coder-7B. Number of model parameters: 7B (7*10^9)

Input

Input Type(s): Code Input Format(s): String of code (meant for prefix code and suffix code) Input Parameters: One-Dimensional (1D) Other Properties Related to Input:

Context Window: The model processes sequential code text with prefix and suffix context
Encoding: UTF-8 text encoding
Input Structure: Fill-in-the-middle (FIM) format with prefix and suffix tokens

Output

Output Type(s): Code Output Format: String Output Parameters: One-Dimensional (1D) Other Properties Related to Output:

Output Length: Single line of code completion
Generation Method: Autoregressive token-by-token generation
Encoding: UTF-8 text encoding
Output Structure: Sequential code text that continues from the input prefix

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration

Runtime Engine(s): vLLM Supported Hardware Microarchitecture Compatibility:

H100
DGX Spark [Supported] Operating System(s): Linux

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

Model Version(s)

v0.3.0

Training, Testing, and Evaluation Datasets

Training Dataset

Source: Subset of bigcode/the-stack-v2 & synthetically generated CUDA data using OSS models like GPT-OSS 120B
Data Modality: Text
Text Training Data Size: ~700000 samples
Data Collection Method by dataset: Hybrid: Automated, Synthetic
Labeling Method by dataset: Not Applicable
Properties (Quantity, Dataset Descriptions, Sensor(s)): ~700,000 samples. Text modality (source code). Content includes open-source CUDA and general programming code collected from permissive-licensed repositories, as well as machine-generated synthetic CUDA code produced by OSS models. Primarily English-language code with CUDA-specific constructs and APIs. No sensor data involved.

Testing Dataset

Source: NVIDIA Internal Data
Data Collection Method by dataset: Automated
Labeling Method by dataset: Not Applicable
Properties (Quantity, Dataset Descriptions, Sensor(s)): 2,156 samples. Text modality (source code). Content consists of internal proprietary CUDA and HPC library code (e.g., cuDNN, cuda-hpc) parsed from internal GitLab repositories. Code is CUDA-specific with domain-specific APIs and patterns. No sensor data involved.

Evaluation Dataset

Source: Subset of bigcode/the-stack-v2
Data Collection Method by dataset: Automated
Labeling Method by dataset: Not Applicable
Properties (Quantity, Dataset Descriptions, Sensor(s)): ~33,000 samples. Each sample corresponds to a single source code file. Text modality (source code). Content includes open-source code collected from permissive-licensed repositories. CUDA and general programming code in English. No sensor data involved.

Inference

Acceleration Engine: vLLM

Test Hardware:

H100
DGX Spark

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards. Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.

Downloads last month: -

Safetensors

Model size

8B params

Tensor type

BF16

Model tree for nvidia/CUDA-Autocomplete

Base model

Qwen/Qwen2.5-7B

Finetuned

Qwen/Qwen2.5-Coder-7B

Finetuned

(98)

this model

Dataset used to train nvidia/CUDA-Autocomplete

Paper for nvidia/CUDA-Autocomplete

Qwen2.5-Coder Technical Report

Paper • 2409.12186 • Published Sep 18, 2024 • 157