#
0.1B
2026-05-29
折腾0.1B小模型
配置:AutoDL RTX 2080 Ti (22GB) ×2 | 本地笔记本 MX130 (2GB) CPU Only
目标:服务器上跑完整 Omni(语音+视觉+文本),本地跑 Thinker 纯文本问答
最后更新:2026-05-29
目录
- 一、AutoDL 服务器部署完整 Omni
- 二、测试推理
- 三、Web Demo + SSH 隧道
- 四、剥离 Thinker 权重(给本地用)
- 五、本地笔记本 CPU 纯文本推理
- 六、不折腾方案:Ollama 一键运行
- 附、踩坑记录
一、AutoDL 服务器部署完整 Omni
1.1 克隆仓库 & 安装依赖
git clone --depth 1 https://github.com/jingyaogong/minimind-o
cd minimind-o
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple1.2 下载模型权重(ModelScope)
⚠️ 不要用git clone——会得到 131 字节的 LFS 指针文件。ModelScope 的modelscope download直下真实文件。
# 语音识别
modelscope download --model gongjy/SenseVoiceSmall --local_dir ./model/SenseVoiceSmall
# 视觉编码器
modelscope download --model gongjy/siglip2-base-p32-256-ve --local_dir ./model/siglip2-base-p32-256-ve
# 音频编解码(Talker 核心)
modelscope download --model gongjy/mimi --local_dir ./model/mimi
# 说话人识别
modelscope download --model gongjy/campplus --local_dir ./model/campplus
# MiniMind-O 发布权重
modelscope download --model gongjy/minimind-3o-pytorch --local_dir ./out1.3 修复 CUDA 版本不匹配
AutoDL 实例装的是 CUDA 12.4,但 pip install -r requirements.txt 可能装了 CUDA 13.x 编译版 PyTorch。必须重装:
pip uninstall torch torchaudio torchvision -y
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu1241.4 安装 ffmpeg
apt-get update && apt-get install -y ffmpeg二、测试推理
# 完整 Omni 推理(语音输入 → 文本输出)
python eval_omni.py --load_from model --weight sft_omni三、Web Demo + SSH 隧道
3.1 下载 HuggingFace 格式的 tokenizer/config
modelscope download --model gongjy/minimind-3o --local_dir /root/minimind-o/minimind-3o
cp -r minimind-3o ./scripts/minimind-3o3.2 启动 Web Demo
cd /root/minimind-o/scripts
python web_demo_omni.py --port 7860端口避开 8888(Jupyter Lab 占用)。
3.3 本地 SSH 隧道访问
# 在你本地电脑上执行(Windows/Linux/Mac 均可)
ssh -CNg -L 7860:127.0.0.1:7860 root@<autodl-ip> -p <ssh-port>然后浏览器打开 http://127.0.0.1:7860 即可访问 WebUI。
四、剥离 Thinker 权重(给本地用)
完整 Omni 包含 Thinker(LLM)+ Talker(语音)+ Vision(视觉),体积大、需要 GPU。本地笔记本 MX130 只有 2GB 显存(Compute Capability 5.0),PyTorch 2.x 已不支持,只能跑 CPU 纯文本。
用 extract_thinker_v2_fix.py 从 .pth 权重中剥离出 Thinker。
4.1 在服务器上运行提取脚本
# extract_thinker_v2_fix.py(在 AutoDL 上运行)
import torch
import os, json, shutil
SFT_WEIGHT = "out/sft_omni_768.pth" # ⚠️ 先 ls out/ 确认文件名
THINKER_OUT = "./thinker_hf"
print("[1/4] 加载 SFT 权重...")
ckpt = torch.load(SFT_WEIGHT, map_location="cpu", weights_only=True)
print(f" 总键数: {len(ckpt)}")
print("\n[2/4] 提取 Thinker 并映射键名...")
thinker_ckpt = {}
talker_skip = 0
for k, v in ckpt.items():
# 跳过 Talker/audio/vision 相关 key
if any(x in k.lower() for x in ["talker", "audio_proj", "vision_proj", "mimi", "spk"]):
talker_skip += 1
continue
new_k = k
if k.startswith("llm."):
new_k = k[4:] # 去掉 llm. 前缀
if not new_k.startswith("model."):
new_k = f"model.{new_k}" # 加上 model. 前缀
thinker_ckpt[new_k] = v
print(f" Thinker 参数: {len(thinker_ckpt)} keys")
print(f" 跳过: {talker_skip} keys")
print(f"\n[3/4] 保存...")
os.makedirs(THINKER_OUT, exist_ok=True)
torch.save(thinker_ckpt, os.path.join(THINKER_OUT, "pytorch_model.bin"))
# 嵌入 config.json
config = {
"architectures": ["MiniMindForCausalLM"],
"model_type": "minimind",
"hidden_size": 768, "num_hidden_layers": 8, "vocab_size": 6400,
"bos_token_id": 1, "eos_token_id": 2,
"num_attention_heads": 8, "num_key_value_heads": 4,
"head_dim": 96, "hidden_act": "silu",
"intermediate_size": 2432, "max_position_embeddings": 32768,
"rms_norm_eps": 1e-06, "rope_theta": 1000000.0,
"use_moe": False, "flash_attn": True,
"dropout": 0.0, "tie_word_embeddings": True,
"dtype": "float32", "transformers_version": "4.57.6"
}
with open(os.path.join(THINKER_OUT, "config.json"), "w") as f:
json.dump(config, f, indent=2)
# 复制 tokenizer 文件
OMNI_DIR = "./minimind-3o"
for fn in ["tokenizer.json", "tokenizer_config.json", "special_tokens_map.json",
"generation_config.json", "chat_template.jinja"]:
src = os.path.join(OMNI_DIR, fn)
if os.path.exists(src):
shutil.copy(src, os.path.join(THINKER_OUT, fn))
print(f"\n✅ 完成!输出: {os.path.abspath(THINKER_OUT)}")4.2 下载到本地
将 thinker_hf/ 整个目录下载到笔记本(scp / AutoDL 文件管理 / SFTP 均可),放在项目目录下。
五、本地笔记本 CPU 纯文本推理
5.1 安装 CPU 版 PyTorch
# 笔记本 MX130 太老,装 CPU 版
pip install torch --index-url https://download.pytorch.org/whl/cpu -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install transformers -i https://pypi.tuna.tsinghua.edu.cn/simple5.2 纯 PyTorch 推理脚本
⚠️ 不用 transformers.from_pretrained()——避免 HF 缓存、auto_map、trust_remote_code 等一堆坑。"""
thinker_infer_v4.py — 纯 PyTorch + 官方 tokenizer,零 HF 缓存问题
"""
import torch
import torch.nn as nn
import torch.nn.functional as F
import math, os
# ─── 加载 tokenizer ──────────────────────────────────────
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_file("thinker_hf/tokenizer.json")
vocab_size = tokenizer.get_vocab_size()
BOS_ID = tokenizer.token_to_id("<|beginoftext|>") or 1
EOS_ID = tokenizer.token_to_id("<|endoftext|>") or 2
print(f"Vocab: {vocab_size}, BOS={BOS_ID}, EOS={EOS_ID}")
# ─── 模型定义(与 MiniMind 架构一致)──────────────────────
def precompute_freqs_cis(dim, end=32768, rope_base=1e6):
freqs = 1.0 / (rope_base ** (torch.arange(0, dim, 2).float() / dim))
t = torch.arange(end, dtype=torch.float32)
freqs = torch.outer(t, freqs)
return torch.cos(freqs).unsqueeze(0), torch.sin(freqs).unsqueeze(0)
class RMSNorm(nn.Module):
def __init__(self, dim, eps=1e-6):
super().__init__()
self.eps, self.weight = eps, nn.Parameter(torch.ones(dim))
def forward(self, x):
return x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps) * self.weight
class Attention(nn.Module):
def __init__(self, hidden=768, n_heads=8, n_kv=4, head_dim=96):
super().__init__()
self.n_heads, self.n_kv, self.head_dim = n_heads, n_kv, head_dim
self.q_proj = nn.Linear(hidden, n_heads * head_dim, bias=False)
self.k_proj = nn.Linear(hidden, n_kv * head_dim, bias=False)
self.v_proj = nn.Linear(hidden, n_kv * head_dim, bias=False)
self.o_proj = nn.Linear(n_heads * head_dim, hidden, bias=False)
self.q_norm = RMSNorm(head_dim)
self.k_norm = RMSNorm(head_dim)
def forward(self, x, cos, sin, kv_cache=None):
b, s, _ = x.shape
q = self.q_proj(x).view(b, s, self.n_heads, self.head_dim)
k = self.k_proj(x).view(b, s, self.n_kv, self.head_dim)
v = self.v_proj(x).view(b, s, self.n_kv, self.head_dim)
q, k = self.q_norm(q), self.k_norm(k)
# RoPE
cos, sin = cos.unsqueeze(2), sin.unsqueeze(2)
q_r = q.view(b, s, self.n_heads, self.head_dim // 2, 2)
k_r = k.view(b, s, self.n_kv, self.head_dim // 2, 2)
q_r = torch.stack([q_r[..., 0] * cos - q_r[..., 1] * sin,
q_r[..., 0] * sin + q_r[..., 1] * cos], -1).flatten(3)
k_r = torch.stack([k_r[..., 0] * cos - k_r[..., 1] * sin,
k_r[..., 0] * sin + k_r[..., 1] * cos], -1).flatten(3)
q, k = q_r.transpose(1, 2), k_r.transpose(1, 2)
v = v.transpose(1, 2)
if kv_cache is not None:
k = torch.cat([kv_cache[0], k], dim=2)
v = torch.cat([kv_cache[1], v], dim=2)
present = (k, v)
if self.n_kv != self.n_heads:
k = k.repeat_interleave(self.n_heads // self.n_kv, dim=1)
v = v.repeat_interleave(self.n_heads // self.n_kv, dim=1)
attn = F.scaled_dot_product_attention(
q, k, v, is_causal=(kv_cache is None),
scale=1.0 / math.sqrt(self.head_dim)
)
return self.o_proj(attn.transpose(1, 2).reshape(b, s, -1)), present
class FeedForward(nn.Module):
def __init__(self, hidden=768, intermediate=2432):
super().__init__()
self.gate_proj = nn.Linear(hidden, intermediate, bias=False)
self.up_proj = nn.Linear(hidden, intermediate, bias=False)
self.down_proj = nn.Linear(intermediate, hidden, bias=False)
def forward(self, x):
return self.down_proj(F.silu(self.gate_proj(x)) * self.up_proj(x))
class MiniMindBlock(nn.Module):
def __init__(self, hidden=768, n_heads=8, n_kv=4, head_dim=96, intermediate=2432):
super().__init__()
self.input_layernorm = RMSNorm(hidden)
self.post_attention_layernorm = RMSNorm(hidden)
self.self_attn = Attention(hidden, n_heads, n_kv, head_dim)
self.mlp = FeedForward(hidden, intermediate)
def forward(self, x, cos, sin, kv_cache=None):
h, present = self.self_attn(self.input_layernorm(x), cos, sin, kv_cache)
x = x + h
x = x + self.mlp(self.post_attention_layernorm(x))
return x, present
class MiniMindThinker(nn.Module):
def __init__(self, vocab_size=6400, hidden=768, n_layers=8, n_heads=8,
n_kv=4, head_dim=96, intermediate=2432, max_len=32768):
super().__init__()
self.n_layers = n_layers
self.head_dim = head_dim
self.max_len = max_len
self.embed_tokens = nn.Embedding(vocab_size, hidden)
self.dropout = nn.Dropout(0)
self.layers = nn.ModuleList([
MiniMindBlock(hidden, n_heads, n_kv, head_dim, intermediate)
for _ in range(n_layers)
])
self.norm = RMSNorm(hidden)
self.lm_head = nn.Linear(hidden, vocab_size, bias=False)
cos, sin = precompute_freqs_cis(dim=head_dim, end=max_len)
self.register_buffer("freqs_cos", cos, persistent=False)
self.register_buffer("freqs_sin", sin, persistent=False)
def generate(self, input_ids, temperature=0.7, top_p=0.9, top_k=50,
max_new_tokens=256, eos_id=2):
device = input_ids.device
kv = [None] * self.n_layers
start_pos = 0
out_tokens = []
for _ in range(max_new_tokens):
seq = input_ids[:, start_pos:] if start_pos > 0 else input_ids
s = seq.shape[1]
cos = self.freqs_cos[:, start_pos:start_pos + s, :].to(device)
sin = self.freqs_sin[:, start_pos:start_pos + s, :].to(device)
h = self.dropout(self.embed_tokens(seq))
for i, layer in enumerate(self.layers):
h, kv[i] = layer(h, cos, sin, kv[i])
h = self.norm(h)
logits = self.lm_head(h[:, -1, :]) / temperature
if top_k > 0:
thresh = torch.topk(logits, top_k)[0][:, -1:]
logits[logits < thresh] = -float('inf')
if top_p < 1.0:
sorted_logits, sorted_idx = torch.sort(logits, descending=True)
cum = torch.cumsum(F.softmax(sorted_logits, -1), -1)
mask = cum > top_p
mask[:, 1:] = mask[:, :-1].clone()
mask[:, 0] = False
logits[:, sorted_idx[0][mask[0]]] = -float('inf')
probs = F.softmax(logits, -1)
nt = torch.multinomial(probs, 1)
tid = nt.item()
input_ids = torch.cat([input_ids, nt], dim=1)
start_pos += s
out_tokens.append(tid)
if tid == eos_id:
break
return out_tokens
# ─── 加载权重 ───────────────────────────────────────────
print("Loading weights...")
ckpt = torch.load("thinker_hf/pytorch_model.bin", map_location="cpu", weights_only=True)
state = {}
for k, v in ckpt.items():
if k.startswith("model."):
state[k[6:]] = v # 去掉 model. 前缀(我们直接构造 MiniMindThinker)
elif k == "lm_head.weight":
state["lm_head.weight"] = v
# tied embedding: lm_head 可能没存
if "lm_head.weight" not in state and "embed_tokens.weight" in state:
state["lm_head.weight"] = state["embed_tokens.weight"]
model = MiniMindThinker(vocab_size=vocab_size)
model.load_state_dict(state, strict=False)
model.eval()
pcount = sum(p.numel() for p in model.parameters()) / 1e6
print(f"Model: {pcount:.1f}M params (CPU)\n")
# ─── 对话循环 ───────────────────────────────────────────
def chat(prompt, max_tokens=256, temperature=0.7):
text = f"<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n"
ids = tokenizer.encode(text).ids
print(f"\n{'='*50}")
print(f"💬 {prompt}")
print(f" input tokens: {len(ids)}")
print(f"{'='*50}")
input_ids = torch.tensor([ids], dtype=torch.long)
with torch.no_grad():
out_ids = model.generate(input_ids, max_new_tokens=max_tokens,
temperature=temperature, eos_id=EOS_ID)
response = tokenizer.decode(out_ids, skip_special_tokens=True)
print(f"🤖 {response.strip()}\n")
print("=" * 50)
print(" MiniMind Thinker — Let's go!")
print(f" {pcount:.0f}M params | CPU")
print("=" * 50)
while True:
try:
p = input("\n👤 你: ").strip()
except (EOFError, KeyboardInterrupt):
print("\nBye!")
break
if not p: continue
if p.lower() in ("quit", "exit", "q"):
print("Bye!"); break
chat(p)5.3 运行
python thinker_infer_v4.py你会看到约 60~70M 参数、纯 CPU 推理的 MiniMind Thinker,在笔记本上可流畅对话。
六、不折腾方案:Ollama 一键运行
如果你不想折腾上面一堆服务器配置、权重剥离、PyTorch 版本匹配……直接用 Ollama。
MiniMind 官方提供了 GGUF 量化模型,一条命令即可在本地运行:
6.1 安装 Ollama
- Windows:https://ollama.com/download/windows
- Linux:
curl -fsSL https://ollama.com/install.sh | sh - macOS:https://ollama.com/download/mac
6.2 一键运行
# MiniMind-3 基础版(~0.1B,最适合低配笔记本)
ollama run jingyaogong/minimind-3GGUF 源:https://huggingface.co/jingyaogong/minimind-3-gguf
6.3 对比
| 方案 | 模型 | 推理速度 | 部署难度 | 功能 |
|---|---|---|---|---|
| Ollama | MiniMind-3 GGUF (Q4_K_M) | 快(llama.cpp 优化) | ⭐ 一行命令 | 纯文本 |
| 剥离 Thinker | MiniMind-O SFT Thinker (FP32) | 较慢(纯 PyTorch CPU) | ⭐⭐⭐ 需服务器+脚本 | 纯文本(SFT 版,对话质量更高) |
| 完整 Omni | MiniMind-O Full | 需要 GPU | ⭐⭐⭐⭐⭐ | 语音+视觉+文本 |
6.4 如果笔记本跑不动 Ollama
MX130 只有 2GB 显存,但 MiniMind-3 的 Q4_K_M 量化版约 150MB,直接用 CPU 推理也完全没问题。Ollama 会自动 fallback 到 CPU。
附、踩坑记录
| 问题 | 原因 | 解决 |
|---|---|---|
tokenizer.json 只有 131 字节 | git clone 拿的是 LFS 指针 | 改用 modelscope download |
libcudart.so.13 not found | PyTorch 编译版 CUDA 13.x vs 系统 CUDA 12.4 | pip install torch --index-url cu124 |
| WebUI 端口 8888 打不开 | Jupyter Lab 占用了 8888 | 用 --port 7860 或其他端口 |
| 权重加载后随机输出 | 缺少 model. 前缀,全部 MISSING | 提取时加 model. 前缀 |
RuntimeError: Boolean value of Tensor | freqs_cos[0,0] 返回多值 | 改用 .all() 判断 |
| HF 缓存了旧版 model code | transformers 缓存到 ~/.cache/ | 删掉缓存目录 |
sft_omni.pth 找不到 | AutoDL 上文件名是 sft_omni_768.pth | 先 ls out/ 确认文件名 |
| MX130 装不了 GPU PyTorch | Compute Capability 5.0,PyTorch 2.x 不支持 | 装 CPU 版 --index-url cpu |
Happy hacking! 🚀
TAGS:
0.1B
相关推荐
- 暂无相关推荐,看看别的吧。

0 评论