Instructions to use prithivMLmods/MiniCPM-V-4.6-abliterated-MAX with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use prithivMLmods/MiniCPM-V-4.6-abliterated-MAX with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="prithivMLmods/MiniCPM-V-4.6-abliterated-MAX")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("prithivMLmods/MiniCPM-V-4.6-abliterated-MAX")
model = AutoModelForImageTextToText.from_pretrained("prithivMLmods/MiniCPM-V-4.6-abliterated-MAX")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use prithivMLmods/MiniCPM-V-4.6-abliterated-MAX with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "prithivMLmods/MiniCPM-V-4.6-abliterated-MAX"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "prithivMLmods/MiniCPM-V-4.6-abliterated-MAX",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/prithivMLmods/MiniCPM-V-4.6-abliterated-MAX

SGLang

How to use prithivMLmods/MiniCPM-V-4.6-abliterated-MAX with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "prithivMLmods/MiniCPM-V-4.6-abliterated-MAX" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "prithivMLmods/MiniCPM-V-4.6-abliterated-MAX",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "prithivMLmods/MiniCPM-V-4.6-abliterated-MAX" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "prithivMLmods/MiniCPM-V-4.6-abliterated-MAX",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use prithivMLmods/MiniCPM-V-4.6-abliterated-MAX with Docker Model Runner:
```
docker model run hf.co/prithivMLmods/MiniCPM-V-4.6-abliterated-MAX
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

MiniCPM-V-4.6-abliterated-MAX

MiniCPM-V-4.6-abliterated-MAX is an optimized release built on top of huihui-ai/Huihui-MiniCPM-V-4.6-abliterated. This version focuses on updated shard sizing, repository optimization, and compatibility improvements for the latest Transformers releases while preserving the capabilities of the original model. The result is a highly capable and ultra-efficient multimodal language model optimized for image, video, and text understanding with streamlined deployment and inference workflows.

This model is intended for research and learning purposes only. Any content generated by it is used at the user’s own risk. The authors and hosting page disclaim any liability for outputs produced by this model. Users are responsible for ensuring safe, ethical, and lawful usage.

Evals

.eval_results: harm_bench_score.yaml

The evaluation was conducted using 2,000 random harmful test prompts to measure the refusal behavior of the language model. The self-reported evaluations provided here are intended only to give an overview of the model. Scores may vary depending on the benchmark and the evaluation strategy used.

Key Highlights

Latest Transformers Compatibility Re-sharded and optimized for improved compatibility with recent Transformers releases.
Optimized Model Sharding Updated shard sizes for improved repository handling, downloading, and deployment efficiency.
Streamlined Inference Experience Optimized packaging and repository structure for smoother loading and inference workflows.
Efficient Multimodal Architecture Built on openbmb/MiniCPM-V-4.6, combining SigLIP2-400M vision encoding with Qwen3.5-0.8B language capabilities for compact yet powerful multimodal understanding.
Image & Video Understanding Supports advanced reasoning across text, images, and videos with efficient deployment on edge and mobile-class hardware.
262K Long Context Support Optimized for extremely long multimodal contexts across text, image, and video inputs.
Research-Friendly Distribution Designed to simplify experimentation, evaluation, and local deployment workflows.
High-Efficiency Deployment Suitable for local inference, lightweight multimodal applications, and research experimentation on consumer-grade GPUs.

Base Model Signatures:

This model has been re-sharded and optimized for the latest Transformers version from the base model: https://huggingface.co/huihui-ai/Huihui-MiniCPM-V-4.6-abliterated.

Quick Start with Transformers

pip install transformers==5.8.0 gradio==6.14.0

import gc
import time
from threading import Thread

import gradio as gr
import torch
from PIL import Image

from transformers import (
    MiniCPMV4_6ForConditionalGeneration,
    AutoProcessor,
    TextIteratorStreamer,
)

MAX_MAX_NEW_TOKENS = 4096
DEFAULT_MAX_NEW_TOKENS = 1024
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

MODEL_ID = "prithivMLmods/MiniCPM-V-4.6-abliterated-MAX"
processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
model = MiniCPMV4_6ForConditionalGeneration.from_pretrained(
    MODEL_ID,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
).to(device).eval()


def generate(
    image: Image.Image,
    text: str,
    max_new_tokens: int = DEFAULT_MAX_NEW_TOKENS,
    temperature: float = 0.6,
    top_p: float = 0.9,
    top_k: int = 50,
    repetition_penalty: float = 1.2,
):
    if image is None:
        yield "[ERROR] Please upload an image."
        return
    if not text or not text.strip():
        yield "[ERROR] Please enter your instruction."
        return

    messages = [
        {
            "role": "user",
            "content": [
                {"type": "image"},
                {"type": "text", "text": text},
            ],
        }
    ]
    prompt_full = processor.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )

    inputs = processor(
        text=[prompt_full],
        images=[image],
        return_tensors="pt",
        padding=True,
    ).to(device)

    streamer = TextIteratorStreamer(
        processor.tokenizer if hasattr(processor, "tokenizer") else processor,
        skip_prompt=True,
        skip_special_tokens=True,
    )

    generation_error = {"error": None}
    generation_kwargs = {
        **inputs,
        "streamer": streamer,
        "max_new_tokens": int(max_new_tokens),
        "do_sample": True,
        "temperature": float(temperature),
        "top_p": float(top_p),
        "top_k": int(top_k),
        "repetition_penalty": float(repetition_penalty),
    }

    def _run():
        try:
            model.generate(**generation_kwargs)
        except Exception as e:
            generation_error["error"] = e
            try:
                streamer.end()
            except Exception:
                pass

    thread = Thread(target=_run, daemon=True)
    thread.start()

    buffer = ""
    for new_text in streamer:
        buffer += new_text
        time.sleep(0.01)
        yield buffer

    thread.join(timeout=1.0)

    if generation_error["error"] is not None:
        err = f"[ERROR] {str(generation_error['error'])}"
        yield (buffer + "\n\n" + err) if buffer.strip() else err
        return

    if not buffer.strip():
        yield "[ERROR] No output was generated."

    gc.collect()
    if torch.cuda.is_available():
        torch.cuda.empty_cache()

Base Model Information

openbmb/MiniCPM-V-4.6 is a 1.3B-parameter dense multimodal language model developed by OpenBMB (Tsinghua NLP + ModelBest). It is built using SigLIP2-400M for visual encoding and Qwen3.5-0.8B as the language backbone, optimized for efficient multimodal understanding on edge and mobile hardware while supporting long-context reasoning across text, image, and video modalities.

Intended Use

Multimodal Research Studying multimodal reasoning, perception, and instruction-following behavior across text, image, and video inputs.
Model Evaluation Benchmarking and analyzing multimodal language models under a variety of testing conditions.
Edge & Local AI Deployment Running compact multimodal AI systems efficiently on consumer hardware and edge devices.
Research Prototyping Experimentation with efficient multimodal transformer architectures and deployment workflows.

Limitations & Risks

Important Note: This model inherits the behavior and characteristics of its base model.

Potential Hallucinations Multimodal reasoning may occasionally produce inaccurate or fabricated interpretations.
User Responsibility Outputs should be reviewed before use in critical or high-stakes applications.
Multimodal Limitations Performance may vary depending on image quality, video complexity, prompt design, and context length.
Deployment Considerations While optimized for efficiency, high-resolution image and video inference may still require substantial VRAM and optimized runtimes depending on workload complexity.