Instructions to use nvidia/CUDA-Autocomplete with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use nvidia/CUDA-Autocomplete with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="nvidia/CUDA-Autocomplete") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("nvidia/CUDA-Autocomplete") model = AutoModelForMultimodalLM.from_pretrained("nvidia/CUDA-Autocomplete") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use nvidia/CUDA-Autocomplete with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "nvidia/CUDA-Autocomplete" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nvidia/CUDA-Autocomplete", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/nvidia/CUDA-Autocomplete
- SGLang
How to use nvidia/CUDA-Autocomplete with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "nvidia/CUDA-Autocomplete" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nvidia/CUDA-Autocomplete", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "nvidia/CUDA-Autocomplete" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nvidia/CUDA-Autocomplete", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use nvidia/CUDA-Autocomplete with Docker Model Runner:
docker model run hf.co/nvidia/CUDA-Autocomplete
Model Overview
NVIDIA CUDA Autocomplete is a fine-tuned version of Qwen/Qwen2.5-Coder-7B enhanced for CUDA code completion. The model takes as input two strings of code context: the prefix (code before the cursor) and the suffix (code after the cursor), and outputs several lines of code that logically continues the prefix. By analyzing the surrounding code structure, variable names, and CUDA-specific patterns, the model predicts the most likely next line of code, enabling intelligent autocomplete functionality for general programming and CUDA development in the Nsight Copilot extension for VSCode and Cursor.
This model is ready for commercial/non-commercial use.
License/Terms of Use
Use of this model is governed by the NVIDIA Open Model License Agreement.
Additional Information.For Qwen2.5-Coder-7B, Apache License, Version 2.0.
Deployment Geography
Global
Use Case
This model is intended to be used for code completion in the Nsight Copilot extension for VSCode / Cursor.
Release Date
Huggingface : 06/09/2026 via https://huggingface.co/nvidia/CUDA-Autocomplete
Reference(s)
Qwen2.5-Coder paper Qwen2.5-Coder blog Qwen2.5-Coder GitHub repository
Model Architecture
Architecture Type: Transformer Network Architecture: Qwen2ForCausalLM This model was developed based on Qwen/Qwen2.5-Coder-7B. Number of model parameters: 7B (7*10^9)
Input
Input Type(s): Code Input Format(s): String of code (meant for prefix code and suffix code) Input Parameters: One-Dimensional (1D) Other Properties Related to Input:
- Context Window: The model processes sequential code text with prefix and suffix context
- Encoding: UTF-8 text encoding
- Input Structure: Fill-in-the-middle (FIM) format with prefix and suffix tokens
Output
Output Type(s): Code Output Format: String Output Parameters: One-Dimensional (1D) Other Properties Related to Output:
- Output Length: Single line of code completion
- Generation Method: Autoregressive token-by-token generation
- Encoding: UTF-8 text encoding
- Output Structure: Sequential code text that continues from the input prefix
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
Software Integration
Runtime Engine(s): vLLM Supported Hardware Microarchitecture Compatibility:
- H100
- DGX Spark [Supported] Operating System(s): Linux
The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.
Model Version(s)
v0.3.0
Training, Testing, and Evaluation Datasets
Training Dataset
- Source: Subset of bigcode/the-stack-v2 & synthetically generated CUDA data using OSS models like GPT-OSS 120B
- Data Modality: Text
- Text Training Data Size: ~700000 samples
- Data Collection Method by dataset: Hybrid: Automated, Synthetic
- Labeling Method by dataset: Not Applicable
- Properties (Quantity, Dataset Descriptions, Sensor(s)): ~700,000 samples. Text modality (source code). Content includes open-source CUDA and general programming code collected from permissive-licensed repositories, as well as machine-generated synthetic CUDA code produced by OSS models. Primarily English-language code with CUDA-specific constructs and APIs. No sensor data involved.
Testing Dataset
- Source: NVIDIA Internal Data
- Data Collection Method by dataset: Automated
- Labeling Method by dataset: Not Applicable
- Properties (Quantity, Dataset Descriptions, Sensor(s)): 2,156 samples. Text modality (source code). Content consists of internal proprietary CUDA and HPC library code (e.g., cuDNN, cuda-hpc) parsed from internal GitLab repositories. Code is CUDA-specific with domain-specific APIs and patterns. No sensor data involved.
Evaluation Dataset
- Source: Subset of bigcode/the-stack-v2
- Data Collection Method by dataset: Automated
- Labeling Method by dataset: Not Applicable
- Properties (Quantity, Dataset Descriptions, Sensor(s)): ~33,000 samples. Each sample corresponds to a single source code file. Text modality (source code). Content includes open-source code collected from permissive-licensed repositories. CUDA and general programming code in English. No sensor data involved.
Inference
Acceleration Engine: vLLM
Test Hardware:
- H100
- DGX Spark
Ethical Considerations
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards. Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.
- Downloads last month
- -