折腾0.1B小模型

配置：AutoDL RTX 2080 Ti (22GB) ×2 | 本地笔记本 MX130 (2GB) CPU Only
目标：服务器上跑完整 Omni（语音+视觉+文本），本地跑 Thinker 纯文本问答
最后更新：2026-05-29

一、AutoDL 服务器部署完整 Omni

1.1 克隆仓库 & 安装依赖

git clone --depth 1 https://github.com/jingyaogong/minimind-o
cd minimind-o
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple

1.2 下载模型权重（ModelScope）

⚠️ 不要用 git clone——会得到 131 字节的 LFS 指针文件。ModelScope 的 modelscope download 直下真实文件。

# 语音识别
modelscope download --model gongjy/SenseVoiceSmall --local_dir ./model/SenseVoiceSmall

# 视觉编码器
modelscope download --model gongjy/siglip2-base-p32-256-ve --local_dir ./model/siglip2-base-p32-256-ve

# 音频编解码（Talker 核心）
modelscope download --model gongjy/mimi --local_dir ./model/mimi

# 说话人识别
modelscope download --model gongjy/campplus --local_dir ./model/campplus

# MiniMind-O 发布权重
modelscope download --model gongjy/minimind-3o-pytorch --local_dir ./out

1.3 修复 CUDA 版本不匹配

AutoDL 实例装的是 CUDA 12.4，但 pip install -r requirements.txt 可能装了 CUDA 13.x 编译版 PyTorch。必须重装：

pip uninstall torch torchaudio torchvision -y
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu124

1.4 安装 ffmpeg

apt-get update && apt-get install -y ffmpeg

二、测试推理

# 完整 Omni 推理（语音输入 → 文本输出）
python eval_omni.py --load_from model --weight sft_omni

三、Web Demo + SSH 隧道

3.1 下载 HuggingFace 格式的 tokenizer/config

modelscope download --model gongjy/minimind-3o --local_dir /root/minimind-o/minimind-3o
cp -r minimind-3o ./scripts/minimind-3o

3.2 启动 Web Demo

cd /root/minimind-o/scripts
python web_demo_omni.py --port 7860

端口避开 8888（Jupyter Lab 占用）。

3.3 本地 SSH 隧道访问

# 在你本地电脑上执行（Windows/Linux/Mac 均可）
ssh -CNg -L 7860:127.0.0.1:7860 root@<autodl-ip> -p <ssh-port>

然后浏览器打开 http://127.0.0.1:7860 即可访问 WebUI。

四、剥离 Thinker 权重（给本地用）

完整 Omni 包含 Thinker（LLM）+ Talker（语音）+ Vision（视觉），体积大、需要 GPU。本地笔记本 MX130 只有 2GB 显存（Compute Capability 5.0），PyTorch 2.x 已不支持，只能跑 CPU 纯文本。

用 extract_thinker_v2_fix.py 从 .pth 权重中剥离出 Thinker。

4.1 在服务器上运行提取脚本

# extract_thinker_v2_fix.py（在 AutoDL 上运行）
import torch
import os, json, shutil

SFT_WEIGHT = "out/sft_omni_768.pth"   # ⚠️ 先 ls out/ 确认文件名
THINKER_OUT = "./thinker_hf"

print("[1/4] 加载 SFT 权重...")
ckpt = torch.load(SFT_WEIGHT, map_location="cpu", weights_only=True)
print(f"  总键数: {len(ckpt)}")

print("\n[2/4] 提取 Thinker 并映射键名...")
thinker_ckpt = {}
talker_skip = 0

for k, v in ckpt.items():
    # 跳过 Talker/audio/vision 相关 key
    if any(x in k.lower() for x in ["talker", "audio_proj", "vision_proj", "mimi", "spk"]):
        talker_skip += 1
        continue
    new_k = k
    if k.startswith("llm."):
        new_k = k[4:]                        # 去掉 llm. 前缀
    if not new_k.startswith("model."):
        new_k = f"model.{new_k}"             # 加上 model. 前缀
    thinker_ckpt[new_k] = v

print(f"  Thinker 参数: {len(thinker_ckpt)} keys")
print(f"  跳过: {talker_skip} keys")

print(f"\n[3/4] 保存...")
os.makedirs(THINKER_OUT, exist_ok=True)
torch.save(thinker_ckpt, os.path.join(THINKER_OUT, "pytorch_model.bin"))

# 嵌入 config.json
config = {
    "architectures": ["MiniMindForCausalLM"],
    "model_type": "minimind",
    "hidden_size": 768, "num_hidden_layers": 8, "vocab_size": 6400,
    "bos_token_id": 1, "eos_token_id": 2,
    "num_attention_heads": 8, "num_key_value_heads": 4,
    "head_dim": 96, "hidden_act": "silu",
    "intermediate_size": 2432, "max_position_embeddings": 32768,
    "rms_norm_eps": 1e-06, "rope_theta": 1000000.0,
    "use_moe": False, "flash_attn": True,
    "dropout": 0.0, "tie_word_embeddings": True,
    "dtype": "float32", "transformers_version": "4.57.6"
}
with open(os.path.join(THINKER_OUT, "config.json"), "w") as f:
    json.dump(config, f, indent=2)

# 复制 tokenizer 文件
OMNI_DIR = "./minimind-3o"
for fn in ["tokenizer.json", "tokenizer_config.json", "special_tokens_map.json",
           "generation_config.json", "chat_template.jinja"]:
    src = os.path.join(OMNI_DIR, fn)
    if os.path.exists(src):
        shutil.copy(src, os.path.join(THINKER_OUT, fn))

print(f"\n✅ 完成！输出: {os.path.abspath(THINKER_OUT)}")

4.2 下载到本地

将 thinker_hf/ 整个目录下载到笔记本（scp / AutoDL 文件管理 / SFTP 均可），放在项目目录下。

五、本地笔记本 CPU 纯文本推理

5.1 安装 CPU 版 PyTorch

# 笔记本 MX130 太老，装 CPU 版
pip install torch --index-url https://download.pytorch.org/whl/cpu -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install transformers -i https://pypi.tuna.tsinghua.edu.cn/simple

5.2 纯 PyTorch 推理脚本

⚠️ 不用 transformers.from_pretrained()——避免 HF 缓存、auto_map、trust_remote_code 等一堆坑。

"""
thinker_infer_v4.py — 纯 PyTorch + 官方 tokenizer，零 HF 缓存问题
"""
import torch
import torch.nn as nn
import torch.nn.functional as F
import math, os

# ─── 加载 tokenizer ──────────────────────────────────────
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_file("thinker_hf/tokenizer.json")

vocab_size = tokenizer.get_vocab_size()
BOS_ID = tokenizer.token_to_id("<|beginoftext|>") or 1
EOS_ID = tokenizer.token_to_id("<|endoftext|>") or 2
print(f"Vocab: {vocab_size}, BOS={BOS_ID}, EOS={EOS_ID}")

# ─── 模型定义（与 MiniMind 架构一致）──────────────────────

def precompute_freqs_cis(dim, end=32768, rope_base=1e6):
    freqs = 1.0 / (rope_base ** (torch.arange(0, dim, 2).float() / dim))
    t = torch.arange(end, dtype=torch.float32)
    freqs = torch.outer(t, freqs)
    return torch.cos(freqs).unsqueeze(0), torch.sin(freqs).unsqueeze(0)

class RMSNorm(nn.Module):
    def __init__(self, dim, eps=1e-6):
        super().__init__()
        self.eps, self.weight = eps, nn.Parameter(torch.ones(dim))
    def forward(self, x):
        return x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps) * self.weight

class Attention(nn.Module):
    def __init__(self, hidden=768, n_heads=8, n_kv=4, head_dim=96):
        super().__init__()
        self.n_heads, self.n_kv, self.head_dim = n_heads, n_kv, head_dim
        self.q_proj = nn.Linear(hidden, n_heads * head_dim, bias=False)
        self.k_proj = nn.Linear(hidden, n_kv * head_dim, bias=False)
        self.v_proj = nn.Linear(hidden, n_kv * head_dim, bias=False)
        self.o_proj = nn.Linear(n_heads * head_dim, hidden, bias=False)
        self.q_norm = RMSNorm(head_dim)
        self.k_norm = RMSNorm(head_dim)

    def forward(self, x, cos, sin, kv_cache=None):
        b, s, _ = x.shape
        q = self.q_proj(x).view(b, s, self.n_heads, self.head_dim)
        k = self.k_proj(x).view(b, s, self.n_kv, self.head_dim)
        v = self.v_proj(x).view(b, s, self.n_kv, self.head_dim)
        q, k = self.q_norm(q), self.k_norm(k)

        # RoPE
        cos, sin = cos.unsqueeze(2), sin.unsqueeze(2)
        q_r = q.view(b, s, self.n_heads, self.head_dim // 2, 2)
        k_r = k.view(b, s, self.n_kv, self.head_dim // 2, 2)
        q_r = torch.stack([q_r[..., 0] * cos - q_r[..., 1] * sin,
                           q_r[..., 0] * sin + q_r[..., 1] * cos], -1).flatten(3)
        k_r = torch.stack([k_r[..., 0] * cos - k_r[..., 1] * sin,
                           k_r[..., 0] * sin + k_r[..., 1] * cos], -1).flatten(3)
        q, k = q_r.transpose(1, 2), k_r.transpose(1, 2)
        v = v.transpose(1, 2)

        if kv_cache is not None:
            k = torch.cat([kv_cache[0], k], dim=2)
            v = torch.cat([kv_cache[1], v], dim=2)
        present = (k, v)

        if self.n_kv != self.n_heads:
            k = k.repeat_interleave(self.n_heads // self.n_kv, dim=1)
            v = v.repeat_interleave(self.n_heads // self.n_kv, dim=1)

        attn = F.scaled_dot_product_attention(
            q, k, v, is_causal=(kv_cache is None),
            scale=1.0 / math.sqrt(self.head_dim)
        )
        return self.o_proj(attn.transpose(1, 2).reshape(b, s, -1)), present

class FeedForward(nn.Module):
    def __init__(self, hidden=768, intermediate=2432):
        super().__init__()
        self.gate_proj = nn.Linear(hidden, intermediate, bias=False)
        self.up_proj   = nn.Linear(hidden, intermediate, bias=False)
        self.down_proj = nn.Linear(intermediate, hidden, bias=False)
    def forward(self, x):
        return self.down_proj(F.silu(self.gate_proj(x)) * self.up_proj(x))

class MiniMindBlock(nn.Module):
    def __init__(self, hidden=768, n_heads=8, n_kv=4, head_dim=96, intermediate=2432):
        super().__init__()
        self.input_layernorm = RMSNorm(hidden)
        self.post_attention_layernorm = RMSNorm(hidden)
        self.self_attn = Attention(hidden, n_heads, n_kv, head_dim)
        self.mlp = FeedForward(hidden, intermediate)
    def forward(self, x, cos, sin, kv_cache=None):
        h, present = self.self_attn(self.input_layernorm(x), cos, sin, kv_cache)
        x = x + h
        x = x + self.mlp(self.post_attention_layernorm(x))
        return x, present

class MiniMindThinker(nn.Module):
    def __init__(self, vocab_size=6400, hidden=768, n_layers=8, n_heads=8,
                 n_kv=4, head_dim=96, intermediate=2432, max_len=32768):
        super().__init__()
        self.n_layers = n_layers
        self.head_dim = head_dim
        self.max_len = max_len
        self.embed_tokens = nn.Embedding(vocab_size, hidden)
        self.dropout = nn.Dropout(0)
        self.layers = nn.ModuleList([
            MiniMindBlock(hidden, n_heads, n_kv, head_dim, intermediate)
            for _ in range(n_layers)
        ])
        self.norm = RMSNorm(hidden)
        self.lm_head = nn.Linear(hidden, vocab_size, bias=False)
        cos, sin = precompute_freqs_cis(dim=head_dim, end=max_len)
        self.register_buffer("freqs_cos", cos, persistent=False)
        self.register_buffer("freqs_sin", sin, persistent=False)

    def generate(self, input_ids, temperature=0.7, top_p=0.9, top_k=50,
                 max_new_tokens=256, eos_id=2):
        device = input_ids.device
        kv = [None] * self.n_layers
        start_pos = 0
        out_tokens = []

        for _ in range(max_new_tokens):
            seq = input_ids[:, start_pos:] if start_pos > 0 else input_ids
            s = seq.shape[1]
            cos = self.freqs_cos[:, start_pos:start_pos + s, :].to(device)
            sin = self.freqs_sin[:, start_pos:start_pos + s, :].to(device)

            h = self.dropout(self.embed_tokens(seq))
            for i, layer in enumerate(self.layers):
                h, kv[i] = layer(h, cos, sin, kv[i])

            h = self.norm(h)
            logits = self.lm_head(h[:, -1, :]) / temperature

            if top_k > 0:
                thresh = torch.topk(logits, top_k)[0][:, -1:]
                logits[logits < thresh] = -float('inf')
            if top_p < 1.0:
                sorted_logits, sorted_idx = torch.sort(logits, descending=True)
                cum = torch.cumsum(F.softmax(sorted_logits, -1), -1)
                mask = cum > top_p
                mask[:, 1:] = mask[:, :-1].clone()
                mask[:, 0] = False
                logits[:, sorted_idx[0][mask[0]]] = -float('inf')

            probs = F.softmax(logits, -1)
            nt = torch.multinomial(probs, 1)
            tid = nt.item()

            input_ids = torch.cat([input_ids, nt], dim=1)
            start_pos += s
            out_tokens.append(tid)

            if tid == eos_id:
                break
        return out_tokens


# ─── 加载权重 ───────────────────────────────────────────
print("Loading weights...")
ckpt = torch.load("thinker_hf/pytorch_model.bin", map_location="cpu", weights_only=True)

state = {}
for k, v in ckpt.items():
    if k.startswith("model."):
        state[k[6:]] = v          # 去掉 model. 前缀（我们直接构造 MiniMindThinker）
    elif k == "lm_head.weight":
        state["lm_head.weight"] = v

# tied embedding: lm_head 可能没存
if "lm_head.weight" not in state and "embed_tokens.weight" in state:
    state["lm_head.weight"] = state["embed_tokens.weight"]

model = MiniMindThinker(vocab_size=vocab_size)
model.load_state_dict(state, strict=False)
model.eval()

pcount = sum(p.numel() for p in model.parameters()) / 1e6
print(f"Model: {pcount:.1f}M params (CPU)\n")


# ─── 对话循环 ───────────────────────────────────────────
def chat(prompt, max_tokens=256, temperature=0.7):
    text = f"<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n"
    ids = tokenizer.encode(text).ids
    print(f"\n{'='*50}")
    print(f"💬 {prompt}")
    print(f"  input tokens: {len(ids)}")
    print(f"{'='*50}")

    input_ids = torch.tensor([ids], dtype=torch.long)
    with torch.no_grad():
        out_ids = model.generate(input_ids, max_new_tokens=max_tokens,
                                 temperature=temperature, eos_id=EOS_ID)
    response = tokenizer.decode(out_ids, skip_special_tokens=True)
    print(f"🤖 {response.strip()}\n")


print("=" * 50)
print("  MiniMind Thinker — Let's go!")
print(f"  {pcount:.0f}M params | CPU")
print("=" * 50)

while True:
    try:
        p = input("\n👤 你: ").strip()
    except (EOFError, KeyboardInterrupt):
        print("\nBye!")
        break
    if not p: continue
    if p.lower() in ("quit", "exit", "q"):
        print("Bye!"); break
    chat(p)

5.3 运行

python thinker_infer_v4.py

你会看到约 60~70M 参数、纯 CPU 推理的 MiniMind Thinker，在笔记本上可流畅对话。

六、不折腾方案：Ollama 一键运行

如果你不想折腾上面一堆服务器配置、权重剥离、PyTorch 版本匹配……直接用 Ollama。

MiniMind 官方提供了 GGUF 量化模型，一条命令即可在本地运行：

6.1 安装 Ollama

Windows：https://ollama.com/download/windows
Linux：curl -fsSL https://ollama.com/install.sh | sh
macOS：https://ollama.com/download/mac

6.2 一键运行

# MiniMind-3 基础版（~0.1B，最适合低配笔记本）
ollama run jingyaogong/minimind-3

GGUF 源：https://huggingface.co/jingyaogong/minimind-3-gguf

6.3 对比

方案	模型	推理速度	部署难度	功能
Ollama	MiniMind-3 GGUF (Q4_K_M)	快（llama.cpp 优化）	⭐ 一行命令	纯文本
剥离 Thinker	MiniMind-O SFT Thinker (FP32)	较慢（纯 PyTorch CPU）	⭐⭐⭐ 需服务器+脚本	纯文本（SFT 版，对话质量更高）
完整 Omni	MiniMind-O Full	需要 GPU	⭐⭐⭐⭐⭐	语音+视觉+文本

6.4 如果笔记本跑不动 Ollama

MX130 只有 2GB 显存，但 MiniMind-3 的 Q4_K_M 量化版约 150MB，直接用 CPU 推理也完全没问题。Ollama 会自动 fallback 到 CPU。

附、踩坑记录

问题	原因	解决
`tokenizer.json` 只有 131 字节	`git clone` 拿的是 LFS 指针	改用 `modelscope download`
`libcudart.so.13 not found`	PyTorch 编译版 CUDA 13.x vs 系统 CUDA 12.4	`pip install torch --index-url cu124`
WebUI 端口 8888 打不开	Jupyter Lab 占用了 8888	用 `--port 7860` 或其他端口
权重加载后随机输出	缺少 `model.` 前缀，全部 MISSING	提取时加 `model.` 前缀
`RuntimeError: Boolean value of Tensor`	`freqs_cos[0,0]` 返回多值	改用 `.all()` 判断
HF 缓存了旧版 model code	transformers 缓存到 `~/.cache/`	删掉缓存目录
`sft_omni.pth` 找不到	AutoDL 上文件名是 `sft_omni_768.pth`	先 `ls out/` 确认文件名
MX130 装不了 GPU PyTorch	Compute Capability 5.0，PyTorch 2.x 不支持	装 CPU 版 `--index-url cpu`

Happy hacking! 🚀

一隅

折腾0.1B小模型

目录

一、AutoDL 服务器部署完整 Omni

1.1 克隆仓库 & 安装依赖

1.2 下载模型权重（ModelScope）

1.3 修复 CUDA 版本不匹配

1.4 安装 ffmpeg

二、测试推理

三、Web Demo + SSH 隧道

3.1 下载 HuggingFace 格式的 tokenizer/config

3.2 启动 Web Demo

3.3 本地 SSH 隧道访问

四、剥离 Thinker 权重（给本地用）

4.1 在服务器上运行提取脚本

4.2 下载到本地

五、本地笔记本 CPU 纯文本推理

5.1 安装 CPU 版 PyTorch

5.2 纯 PyTorch 推理脚本

5.3 运行

六、不折腾方案：Ollama 一键运行

6.1 安装 Ollama

6.2 一键运行

6.3 对比

6.4 如果笔记本跑不动 Ollama

附、踩坑记录

相关推荐

0 评论

发表评论

目录

一、AutoDL 服务器部署完整 Omni

1.1 克隆仓库 & 安装依赖

1.2 下载模型权重（ModelScope）

1.3 修复 CUDA 版本不匹配

1.4 安装 ffmpeg

二、测试推理

三、Web Demo + SSH 隧道

3.1 下载 HuggingFace 格式的 tokenizer/config

3.2 启动 Web Demo

3.3 本地 SSH 隧道访问

四、剥离 Thinker 权重（给本地用）

4.1 在服务器上运行提取脚本

4.2 下载到本地

五、本地笔记本 CPU 纯文本推理

5.1 安装 CPU 版 PyTorch

5.2 纯 PyTorch 推理脚本

5.3 运行

六、不折腾方案：Ollama 一键运行

6.1 安装 Ollama

6.2 一键运行

6.3 对比

6.4 如果笔记本跑不动 Ollama

附、踩坑记录

THANK YOU!

相关推荐

0 评论

发表评论