Pytorch基础问题RuntimeError: Expected all tensors to be on the same device

海南风 發表於 2025-7-6 22:42:00

Pytorch基础问题RuntimeError: Expected all tensors to be on the same device

<h2 id="pytorch基础问题runtimeerror-expected-all-tensors-to-be-on-the-same-device">Pytorch基础问题RuntimeError: Expected all tensors to be on the same device</h2>
<h3 id="introduction">Introduction</h3>
<p>今天让 <em><strong>Claude 4 Sonnet</strong></em> 给我写Nogo的reinforcement learning的训练代码，结果就直接报错：</p>
<blockquote>
<pre><code>RuntimeError: Expected all tensors to be on the same device
</code></pre>
</blockquote>
<p>我平时不怎么注意细节，为了养成不依赖LLM的好习惯，以后对报错写博客记录一下</p>
<h3 id="main-part">Main part</h3>
<p>这个错误 <code>RuntimeError: Expected all tensors to be on the same device</code> 通常出现在把不同设备（比如 CPU 和 GPU）上的张量放在一起进行操作时：</p>
<pre><code class="language-python">import torch
# 一个张量在 CPU 上
a = torch.tensor()
# 一个张量在 GPU 上（假设有 CUDA）
b = torch.tensor().to("cuda")

# 尝试把它们加起来会报错
c = a + b# RuntimeError: Expected all tensors to be on the same device
</code></pre>
<p>解决方法是用 <code>.to()</code> 或 <code>.cuda()</code> 等方法把它们放到同一个设备上。</p>
<h4 id="torchdevice">torch.device</h4>
<p>还是上述代码</p>
<pre><code class="language-python">print(a.device)
print(b.device)
#打印结果：
# cpu
# cuda:0
</code></pre>
<p>设置device</p>
<pre><code class="language-python">device=torch.device("cuda" if torch.cuda.is_available() else "cpu")
</code></pre>
<p>一般默认是第一块即 <code>cuda:0</code></p>
<p>但也可以指定具体哪一个：</p>
<pre><code># 把张量放在第1个GPU（编号0）
a = torch.tensor().to("cuda:0")
# 放在第2个GPU（编号1）
b = torch.tensor().to("cuda:1")
</code></pre>
<p>当然用我的笔记本必然报错：</p>
<blockquote>
<pre><code>RuntimeError: CUDA error: invalid device ordinal**
</code></pre>
</blockquote>
<p>多卡的时候可以打印一下数量</p>
<pre><code class="language-python">print(torch.cuda.device_count())
</code></pre>
<h4 id="torchtensorto">torch.tensor().to()</h4>
<p><code>torch.tensor().to()</code> 是 PyTorch 中将张量转移到指定设备（如 CPU 或 GPU）上的方法。</p>
<p>这个函数可以用来<strong>显式</strong>地将数据放到某个设备上，以便后续运算不报错。</p>
<pre><code class="language-python">dvc=torch.device("cuda" if torch.cuda.is_available() else "cpu")
#设置指定显卡

a = torch.tensor().to(dvc)
b = torch.tensor().to(dvc)
print(a.device)
print(b.device)
c = a + b
print(c)
#打印结果：
# cuda:0
# cuda:0
# tensor(, device='cuda:0')
</code></pre>
<h3 id="summary">Summary</h3>
<p>显示调用张量位置是个好习惯，尤其是多卡训练的情况</p>
<ol>
<li><strong>排查错误的关键线索</strong></li>
</ol>
<p>多卡环境下最常见的错误是：</p>
<blockquote>
<pre><code>RuntimeError: Expected all tensors to be on the same device
</code></pre>
</blockquote>
<p>如果你在关键位置加上 <code>.device</code> 显示，可以快速发现是谁跑偏了。</p>
<ol start="2">
<li><strong>防止设备错配（如模型在 GPU0，数据在 GPU1）</strong></li>
</ol>
<pre><code class="language-python">print("model on:", next(model.parameters()).device)
print("inputs on:", inputs.device)
</code></pre>
<p>这些信息一眼就能看出是否匹配。</p>
<ol start="3">
<li><strong>帮助调试和日志记录</strong></li>
</ol>
<p>在训练日志中打印 device 信息，比如：</p>
<pre><code class="language-python">print(f"Epoch {epoch}: input.device={inputs.device}, label.device={labels.device}")
</code></pre>
<p>可以让你在远程服务器、异步运行、多卡调度环境中更清楚程序状态。</p><br><br>
来源：https://www.cnblogs.com/ZJNpeace/p/18969507

頁: [1]

圆梦公社's Archiver

Pytorch基础问题RuntimeError: Expected all tensors to be on the same device