news 2026/5/9 14:15:51

CANN内核编写指南

作者头像

张小明

前端开发工程师

1.2k 24
文章封面图
CANN内核编写指南

Kernel Authoring Playbook

【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体,本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills

Use this playbook when writing a new kernel or replacing a kernel body in a major way. Keep the path short, tool-first, and incremental.

Step 0 (prerequisite): settle the contract first

Before anything below, do:

  • agent/playbooks/clarify-first.md

If the contract is still ambiguous, stop and ask the user before you start coding.

Goal

Produce a kernel that:

  • matches the exact PyTorch contract
  • uses a justified topology
  • stays within device, sync, and precision rules
  • is validated stage by stage instead of guessed into existence

1. Fast path: prefer tools before large prose

For most new kernels, the cheapest route is:

  1. shortlist examples with the selector tool
  2. pick the topology and print a fresh scaffold
  3. estimate tile/core split only when the shape is non-trivial
  4. open one source example only after the selector or index narrowed it down

Recommended commands:

conda run -n torch210npu python agent/scripts/select_kernel_example.py --query "qk softmax pv" --topology 'cube->vec->cube->vec' --limit 3 --catalog conda run -n torch210npu python agent/scripts/gen_kernel_skeleton.py --name preview_kernel --topology 'cube->vec' --print conda run -n torch210npu python agent/scripts/estimate_matmul_datamove.py --help

If the selector tool is not enough, fall back to:

  1. agent/references/examples/kernel-index.md
  2. one matchingagent/references/examples/kernel-catalog.mdentry
  3. one source file underagent/example/kernels/

2. Minimal read map by scenario

If you need raw values before choosing a deeper route, jump straight to one focused facts page:

  • agent/references/facts-device-runtime.mdfor device caps, pipe pairs, and mutex signatures
  • agent/references/facts-authoring.mdfor hard rules, DBuff formulas, and a2 bridge reminders
  • agent/references/facts-simulator-opexec.mdfor simulator /shape_bindings/OpExecgotchas

Then branch to only one focused follow-up when possible.

SituationRead nextOpen more only if...
pure cube matmul or layout rewriteagent/references/patterns/cube-only.mdsplit choice, unusual layout, or family tile is still unclear
a5 kernel with any vec-side stageagent/references/constraints/a5-device.md@vf()is not enough and you needmicro/ sort specifics
a2 cube -> vecagent/references/patterns/a2-cube-vec.mdreduction, tail, or workspace behavior is still unclear
a2 cube -> vec -> cubeagent/references/patterns/a2-cube-vec-cube.mddelayed stage ownership still feels ambiguous
a2 normalized online softmax (score -> p -> pv -> final divide)agent/references/patterns/a2-cube-vec-cube-vec-softmax.mdyou hit a special failure mode that page does not already cover
sync / counter warningagent/references/constraints/autosync.mdoragent/references/constraints/counters.mdthe warning persists after you traced the producer / consumer lifetime
lowering or simulator behavior looks wrongagent/references/code-paths.mdthe problem is clearly in runtime / parser behavior, not the kernel math

3. Authoring rules that still matter on every kernel

  • Keep the file layout simple:
    1. imports and constants
    2. helper@vf/@funcblocks
    3. main@kernel
    4. visible__main__validation story
  • Prefer normal Pythonif/elif/elseinside kernels; the repository rewrites them before instruction capture.
  • On a5, vec-side math belongs in@vf()/micro; do not copy an a2-style direct vec body into an a5 kernel.
  • Keep local buffers full-tile sized. Usevalid_m/valid_nonly at GM read/write boundaries unless a focused constraint file says otherwise.
  • auto_sync()is same-side ordering only. Cross-side ownership changes still needCvMutexorVcMutex.
  • Keep matmul accumulation infloatunless the contract or an established family says otherwise.
  • If scalar dimensions may alias at runtime, addshape_bindings=at theOpExec(...)(...)call site before rewriting the kernel body.

4. Implementation loop

Use the same small loop for every serious kernel:

  1. build the smallest honest slice of the formula
  2. validate one aligned case
  3. add the next stage or tail path
  4. validate again
  5. only then optimize or fuse further

Practical rules:

  • keep the PyTorch reference inline in__main__
  • keep one aligned case and one tail case visible
  • for large fused kernels, keep standalone stage runners alive until the merged version passes
  • study examples for legal patterns, not for copy-paste

5. When to leave the fast path

Open lower-level implementation files only when the smaller guidance layers stop being enough. Typical escalation path:

  1. easyasc/stub_functions/
  2. easyasc/parser/
  3. easyasc/simulator_v2/

Do not treat passing output with unresolved warnings as good enough. Warnings usually mean the model is incomplete.

【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体,本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills

创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考

版权声明: 本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若内容造成侵权/违法违规/事实不符,请联系邮箱:809451989@qq.com进行投诉反馈,一经查实,立即删除!
网站建设 2026/5/9 14:09:43

CANN/pyasc融合乘加API文档

asc.language.basic.fused_mul_add 【免费下载链接】pyasc 本项目为Python用户提供算子编程接口,支持在昇腾AI处理器上加速计算,接口与Ascend C一一对应并遵守Python原生语法。 项目地址: https://gitcode.com/cann/pyasc asc.language.basic.fus…

作者头像 李华
网站建设 2026/5/9 14:08:59

当AI开始「嫌贫爱富」

GPT-5.5涨价三倍,SpaceX花600亿美元买一个编程工具。这两个新闻放在一起,揭示了一个被大多数人忽略的事实—— 不是AI越来越便宜,是AI市场正在剧烈撕裂。 如果你最近关注AI新闻,可能会有一种错觉:AI正在变得越来越便宜、越来越亲民。 DeepSeek V4开源免费,国产大模型卷出…

作者头像 李华
网站建设 2026/5/9 14:08:40

AI赋能结直肠癌诊断:从多模态数据融合到临床落地的技术实践

1. 项目概述:当AI遇见结直肠癌诊断作为一名在医疗影像和数字病理领域摸爬滚打了十多年的从业者,我亲眼见证了技术如何一步步改变临床诊断的图景。今天想和大家深入聊聊一个既前沿又接地气的领域:AI在结直肠癌诊断中的应用。这不仅仅是“计算机…

作者头像 李华
网站建设 2026/5/9 14:08:02

GTA5线上小助手:免费终极工具让你的洛圣都之旅更轻松

GTA5线上小助手:免费终极工具让你的洛圣都之旅更轻松 【免费下载链接】GTA5OnlineTools GTA5线上小助手 项目地址: https://gitcode.com/gh_mirrors/gt/GTA5OnlineTools GTA5线上小助手是一款专门为《侠盗猎车手5》线上模式玩家设计的完全免费开源工具&#…

作者头像 李华
网站建设 2026/5/9 14:07:09

CANN/shmem算子泛化性测试框架说明

算子泛化性测试框架说明 【免费下载链接】shmem CANN SHMEM 是面向昇腾平台的多机多卡内存通信库,基于OpenSHMEM 标准协议,实现跨设备的高效内存访问与数据同步。 项目地址: https://gitcode.com/cann/shmem 1. 简介 本测试框架旨在为 Shmem 系列…

作者头像 李华
网站建设 2026/5/9 14:06:14

AI欺骗分析:从多智能体博弈到DAMAS防御框架的工程实践

1. 项目概述:当AI学会“说谎”,我们该如何应对?最近几年,我身边不少做多智能体系统(Multi-Agent Systems, MAS)和AI安全的朋友,都在讨论一个越来越无法回避的现象:我们训练出来的AI&…

作者头像 李华