吴恩达深度学习课程四：计算机视觉第四周：卷积网络应用课后习题和代码实践

你的口气比脚气都大 發表於 2026-1-1 17:41:00

吴恩达深度学习课程四：计算机视觉第四周：卷积网络应用课后习题和代码实践

此分类用于记录吴恩达深度学习课程的学习笔记，目前已完结，点击进入全集目录 
课程相关信息链接如下：
<ol>
<li>原课程视频链接：[双语字幕]吴恩达深度学习deeplearning.ai</li>
<li>github课程资料，含课件与笔记:吴恩达深度学习教学资料</li>
<li>课程配套练习（中英）与答案：吴恩达深度学习课后习题与答案</li>
</ol>
本篇为第四课第四周的课后习题和代码实践部分。
<hr>
<h1 id="1-理论习题">1. 理论习题</h1>
【中英】【吴恩达课后测验】Course 4 -卷积神经网络 - 第四周测验 
还是比较简单，我们就不展开了。
<h1 id="2代码实践">2.代码实践</h1>
【中英】【吴恩达课后编程作业】Course 4 -卷积神经网络 - 第四周作业 
再次提醒 Keras 的导库问题。 
老样子，我们还是使用现有的成熟框架来分别实现本周介绍的人脸识别和图像风格转换模型。
<h2 id="21-人脸识别">2.1 人脸识别</h2>
实际上，在如今的实验和实际部署中，人脸识别的整套逻辑已经远比我们在理论部分所介绍的要复杂和完善的多，我们依旧分点来进行介绍。
<h3 id="1python-库insightface">（1）python 库：InsightFace</h3>
作为一个应用中生活中方方面面的技术，就像我们之前介绍的目标检测有ultralytics，人脸识别也有将成熟算法体系工程化、模块化的工具库：InsightFace
InsightFace 是基于 ArcFace 等先进算法构建的人脸分析库，功能涵盖：
<ol>
<li>人脸检测：支持单人或多人图像检测，返回人脸框和关键点；</li>
<li>人脸对齐：通过关键点实现旋转、缩放等对齐操作，提高识别精度；</li>
<li>人脸识别/验证：提取 embedding，进行相似度计算或一对多搜索；</li>
<li>性别、年龄、姿态估计：内置轻量化预测模型；</li>
<li>模块化、可扩展：你可以直接使用预训练模型，也可以替换为自己训练的模型。</li>
</ol>
使用 InsightFace，我们几乎不需要从零实现算法逻辑，只需调用接口即可完成人脸识别的实验和演示。 
同样，我们可以通过 pip 安装 InsightFace 相关依赖：
<pre><code class="language-bash">pip install insightface onnxruntime
</code></pre>
其中：
<ul>
<li>CPU 版本：默认安装即可，无需额外配置。</li>
<li>GPU 版本：如果希望使用 GPU 加速，则需要安装 GPU 版本 ONNX Runtime：</li>
</ul>
<pre><code class="language-bash">pip install onnxruntime-gpu
</code></pre>
有一些注意事项，如果不进行相关配置会导致报错：
<ol>
<li>ONNX Runtime GPU 需要 CUDA 和 cuDNN 与当前版本兼容</li>
<li>在 Windows 上，部分组件需要 Microsoft C++ Build Tools，用于编译部分 C++/Cython 扩展：
<ul>
<li>安装 Visual Studio Installer</li>
<li>勾选 “使用 C++ 的桌面开发”</li>
<li>即可保证 <code>insightface</code> 或其他依赖（如 <code>face3d</code>）可以正确编译。</li>
</ul>
</li>
</ol>
在成功安装 InsightFace 后，我们来看看如何使用这个框架。
<h3 id="2insightface-预训练模型">（2）InsightFace 预训练模型</h3>
我们对预训练模型的使用也早就不陌生了，InsightFace 同样内置了一系列从轻量级到重量级的预训练模型，我们可以通过接口实现下载并调用。 
来简单看一段代码：
<pre><code class="language-python">from insightface.app import FaceAnalysis
app = FaceAnalysis(name='buffalo_l')# 会自动下载预训练模型，这是一种轻量级模型
</code></pre>
当你运行时，模型会缓存到用户目录下，自动下载并解压，无需手动配置：
<pre><code>C:\Users\<用户名>\.insightface\models\
</code></pre>
需要特别说明的是，<code>buffalo_l</code> 模型不仅仅是单一的识别模型，它实际上集成了人脸识别任务中的多个环节，包括：
<ol>
<li>人脸检测 ：在输入图像中快速找到人脸区域。</li>
<li>关键点定位：在检测到的人脸上标出关键点（如眼睛、嘴角、鼻尖等）。</li>
<li>3D 人脸建模（可选，部分模型）：预测人脸的三维结构信息。</li>
<li>人脸特征提取：将每张人脸映射到一个高维向量空间，就是我们之前说的编码。</li>
<li>性别与年龄预测（部分模型）：预测人脸的性别和年龄区间。</li>
</ol>
了解了它的功能后，现在我们就来演示一下：
<h3 id="3示例使用人脸检测">（3）示例使用：人脸检测</h3>
我们用这样一段代码来进行初始化和人脸检测：
<pre><code class="language-python">from insightface.app import FaceAnalysis
import cv2 # 用来读取图像

app = FaceAnalysis(name='buffalo_l')# 轻量级预训练模型
app.prepare(ctx_id=0, det_size=(640, 640))# ctx_id=-1 使用 CPU
img = cv2.imread("images4.jpg") # 读取图片
faces = app.get(img) # 传入模型进行处理

if faces:
print("检测到人脸数量:", len(faces))
print("第一张人脸 embedding:", faces.embedding[:20]) # 只显示前 20 维
print("第二张人脸 embedding:", faces.embedding[:20])
</code></pre>
来看看运行后的效果： 
<img src="https://img2024.cnblogs.com/blog/3708248/202601/3708248-20260101173840687-1529887582.png" alt="image.png" loading="lazy"> 
这样，就完成了对图像的编码。
<h3 id="4示例使用关键点定位">（4）示例使用：关键点定位</h3>
同样，只要所选用的预训练模型支持，<code>app.get(img)</code>的同样可以实现关键点定位：
<pre><code class="language-python">from insightface.app import FaceAnalysis
import cv2

app = FaceAnalysis(name='buffalo_l')
app.prepare(ctx_id=0, det_size=(640, 640))
img = cv2.imread("images4.jpg")
faces = app.get(img)

if faces:
# 画定位图
vis = img.copy()
for face in faces:
 bbox = face.bbox.astype(int)
 cv2.rectangle(vis, (bbox, bbox), (bbox, bbox), (0, 255, 0), 2)
 if 'landmark_2d_106' in face:
 landmarks_2d = face['landmark_2d_106'].astype(int)
 for (x, y) in landmarks_2d:
 cv2.circle(vis, (x, y), 2, (0, 0, 255), -1)
# 保存结果
cv2.imwrite("output.jpg", vis)
print("结果已保存到 output.jpg")
</code></pre>
来看结果： 
<img src="https://img2024.cnblogs.com/blog/3708248/202601/3708248-20260101173840103-664111316.png" alt="image.png" loading="lazy"> 
同样可以较为成功的定位到人脸的各个部位。
<h3 id="5通过相似度学习实现人脸识别">（5）通过相似度学习实现人脸识别</h3>
演示了一些基本功能后，我们回到正题，再回顾一下原理：人脸识别并非要训练“你是谁”的网络，而是“你更像谁”的网络。即学习相似度而非分类，以此来实现具有较高部署价值的系统。 
因此，我们可以把两幅图像输入预训练模型，通过二者的编码来计算它们的相似度，代码如下：
<pre><code class="language-python">import cv2
import numpy as np
from insightface.app import FaceAnalysis

# 1. 初始化
InsightFaceapp = FaceAnalysis(name='buffalo_l')
app.prepare(ctx_id=-1, det_size=(640, 640))# ctx_id=-1 用 CPU
# 2. 读取图片
img1 = cv2.imread("images1.jpg")
img2 = cv2.imread("images2.jpg")

# 3. 检测并提取人脸 embedding
faces1 = app.get(img1)
faces2 = app.get(img2)
emb1 = faces1.embedding
emb2 = faces2.embedding

# 4. 另一种更常用的相似度计算：余弦相似度
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
sim = cosine_similarity(emb1, emb2)
print(f"Cosine similarity: {sim:.4f}")

# 5.阈值决策
if sim > 0.60:
print("很可能是同一个人")
else:
print("很可能不是同一个人")
</code></pre>
来看结果： 
<img src="https://img2024.cnblogs.com/blog/3708248/202601/3708248-20260101173840005-440514446.png" alt="image.png" loading="lazy"> 
这样，我们就通过学习相似度，避免了因人数增加而导致的结构和训练问题，实现了人脸识别。 
而在实际应用中，我们便可以将所有识别目标预先输入模型得到编码并存储，与刷脸时截取的图像输入模型得到的编码依次计算相似度，根据结果进行下一步操作。 
同时，如果希望得到指标更高的结果，我们可以下载更重量级的模型。
现在还有一个问题：如果我们想自己训练模型呢？ 
由于 InsightFace 本身主要是 提供预训练模型和推理/应用接口，所以它并没有像 Pytorch 或 TF 一样完全封装一套“开箱即用、端到端训练你自己数据集”的完整训练流水线。 
因此，我们虽然可以调用 InsightFace 定义的网络结构，但仍需要借助 Pytorch 或 TF 来编码数据输入、训练和梯度下降等逻辑。 
但为了实现二者的兼容使用，我们就又要进行很多设置，一个更常见的思路是完全使用Pytorch 或 TF 来搭建自己的人脸识别网络，但这又涉及一些我们还没介绍过的网络结构。 
因此在这里就不再展开了，在相关理论补充完成后，我们再来进行这部分内容。
下面来看另一部分：图像风格转换。
<h2 id="22-图像风格转换">2.2 图像风格转换</h2>
同样先回顾一下图像风格转换的核心思想：在固定预训练卷积神经网络参数的前提下，利用网络中间层特征，将图像的内容结构与风格统计进行显式分离，并通过在特征空间中最小化相应的代价函数，直接对输入图像进行优化，从而重构出一幅同时匹配内容与风格约束的图像。
因此，要实现一个图像风格转换网络，我们首先要选择一个经过预训练，可以合理提取图像特征的网络作为工具。 
在这里，我选择使用 VGG16 作为预训练模型：
<pre><code class="language-python">vgg = models.vgg16(pretrained=True).features.to(device).eval() # 选择评估模式
for param in vgg.parameters():# 冻结所有网络参数
param.requires_grad = False
</code></pre>
开始编码，首先，我们需要对风格图和内容图两幅图像进行处理：
<pre><code class="language-python"># 读取并预处理输入图像：统一尺寸、转换为张量并送入指定设备，网络输入要求
def load_image(path, size=(512, 256)):
image = Image.open(path).convert("RGB")
transform = transforms.Compose([
 transforms.Resize(size),
 transforms.ToTensor()
])
return transform(image).unsqueeze(0).to(device)

content = load_image("content.jpg")
style = load_image("style.jpg")
</code></pre>
此外，我们还需要一些工具方法：
<pre><code class="language-python"># 将模型输出的张量形式图像后处理为 PIL 图像，用于结果可视化与保存
def tensor_to_pil(tensor):
image = tensor.cpu().clone().squeeze(0)
image = transforms.ToPILImage()(image.clamp(0,1))
return image

# 提取输出特征图用于计算代价
def get_features(x, model, layers):
features = {}
for name, layer in model._modules.items():
 x = layer(x)
 if name in layers:
 features = x
return features

# 计算 Gram 矩阵用于计算风格代价
def gram_matrix(features):
b, ch, h, w = features.size()
features = features.view(b, ch, h*w)
gram = torch.bmm(features, features.transpose(1,2))
return gram / (ch*h*w)
</code></pre>
最后，再进行传播前的超参数设置：
<pre><code class="language-python"># 指定用于内容表示的 VGG 网络层（通常选用较深层，保留语义信息）
content_layers = ['15']
# 指定用于风格表示的 VGG 网络层（从浅到深，捕捉不同尺度的纹理与统计特征）
style_layers = ['0', '5', '10', '15']
# 以内容图像为初始值创建可优化的输出图像张量，并开启梯度计算
output = content.clone().requires_grad_(True).to(device)
# 使用 L-BFGS 优化器对输出图像进行优化（风格迁移中常用）
optimizer = optim.LBFGS()
# 风格损失的权重，控制生成结果中风格特征的强度
style_weight = 1e6
# 内容损失的权重，控制生成结果与原内容图像的相似程度
content_weight = 1
# 提前计算内容图像在指定内容层上的特征表示
content_features = get_features(content, vgg, content_layers)
# 提前计算风格图像在指定风格层上的特征表示
style_features = get_features(style, vgg, style_layers)
# 提前对风格图像的各层特征计算 Gram 矩阵，用于表示风格的统计特性
style_grams = {layer: gram_matrix(style_features) for layer in style_layers}
# 设置优化的总迭代次数
num_steps = 400
# 设置每隔多少步保存或显示一次中间结果
display_step = 50
# 用于存储每隔 display_step 生成的输出图像，便于可视化训练过程
output_images = []
</code></pre>
由此，我们终于可以进行训练了：
<pre><code class="language-python">run =
while run <= num_steps:
def closure():
 optimizer.zero_grad()
 output_features = get_features(output, vgg, content_layers + style_layers)

 # 内容损失
 content_loss = 0
 for layer in content_layers:
 content_loss += torch.mean((output_features - content_features)**2)

 # 风格损失
 style_loss = 0
 for layer in style_layers:
 G = gram_matrix(output_features)
 style_loss += torch.mean((G - style_grams)**2)

 total_loss = content_weight * content_loss + style_weight * style_loss
 total_loss.backward()

 # 保存中间输出
 if run % display_step == 0:
 print(f"Step {run}: Total Loss: {total_loss.item():.2f}")
 output_images.append(tensor_to_pil(output.clone()))
 run += 1
 return total_loss

optimizer.step(closure)
</code></pre>
现在，来看看结果吧： 
<img src="https://img2024.cnblogs.com/blog/3708248/202601/3708248-20260101173839136-1949214051.png" alt="image.png" loading="lazy">
<img src="https://img2024.cnblogs.com/blog/3708248/202601/3708248-20260101173839578-1423866898.png" alt="myplot21312.png" loading="lazy"> 
这样，我们就完成了图像的风格转换，你也可以更换为自己的图像来试试效果。
<h1 id="3-附录">3. 附录</h1>
<h2 id="31-图像风格转换代码-pytorch版">3.1 图像风格转换代码 Pytorch版</h2>
<pre><code class="language-python">import torch
import torch.optim as optim
from torchvision import transforms, models
from PIL import Image
import matplotlib.pyplot as plt
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)
# 读取并预处理输入图像：统一尺寸、转换为张量并送入指定设备，网络输入要求
def load_image(path, size=(256, 512)):
image = Image.open(path).convert("RGB")
transform = transforms.Compose([
 transforms.Resize(size),
 transforms.ToTensor()
])
return transform(image).unsqueeze(0).to(device)

content = load_image("content.jpg")
style = load_image("style.jpg")
# 将模型输出的张量形式图像后处理为 PIL 图像，用于结果可视化与保存
def tensor_to_pil(tensor):
image = tensor.cpu().clone().squeeze(0)
image = transforms.ToPILImage()(image.clamp(0,1))
return image
# 计算 Gram 矩阵用于计算风格代价
def gram_matrix(features):
b, ch, h, w = features.size()
features = features.view(b, ch, h*w)
gram = torch.bmm(features, features.transpose(1,2))
return gram / (ch*h*w)

vgg = models.vgg16(pretrained=True).features.to(device).eval()
for param in vgg.parameters():
param.requires_grad = False

content_layers = ['15']
style_layers = ['0','5','10','15']

def get_features(x, model, layers):
features = {}
for name, layer in model._modules.items():
 x = layer(x)
 if name in layers:
 features = x
return features

output = content.clone().requires_grad_(True).to(device)
optimizer = optim.LBFGS()

style_weight = 1e6
content_weight = 1

content_features = get_features(content, vgg, content_layers)
style_features = get_features(style, vgg, style_layers)
style_grams = {layer: gram_matrix(style_features) for layer in style_layers}

num_steps = 400
display_step = 50

# 用于存储每隔 display_step 的输出
output_images = []

print("开始优化...")
run =
while run <= num_steps:
def closure():
 optimizer.zero_grad()
 output_features = get_features(output, vgg, content_layers + style_layers)

 # 内容损失
 content_loss = 0
 for layer in content_layers:
 content_loss += torch.mean((output_features - content_features)**2)

 # 风格损失
 style_loss = 0
 for layer in style_layers:
 G = gram_matrix(output_features)
 style_loss += torch.mean((G - style_grams)**2)

 total_loss = content_weight * content_loss + style_weight * style_loss
 total_loss.backward()

 # 保存中间输出
 if run % display_step == 0:
 print(f"Step {run}: Total Loss: {total_loss.item():.2f}")
 output_images.append(tensor_to_pil(output.clone()))
 run += 1
 return total_loss

optimizer.step(closure)

num_imgs = len(output_images)
cols = 3
rows = (num_imgs + cols - 1) // cols
plt.figure(figsize=(5*cols, 5*rows))
for i, img in enumerate(output_images):
plt.subplot(rows, cols, i+1)
plt.imshow(img)
plt.axis('off')
plt.title(f"Step {i*display_step}")
plt.tight_layout()
plt.show()
</code></pre>
<h2 id="32-图像风格转换代码-tf版">3.2 图像风格转换代码 TF版</h2>
<pre><code class="language-python">import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt

from tensorflow.keras.applications.vgg16 import VGG16, preprocess_input
from tensorflow.keras.models import Model
from tensorflow.keras.preprocessing import image

def load_and_process_img(path, target_size=None, max_size=512):
img = image.load_img(path)
if max(img.size) > max_size:
 scale = max_size / max(img.size)
 img = img.resize((int(img.size * scale), int(img.size * scale)))
if target_size is not None:
 img = img.resize(target_size)
img = image.img_to_array(img)
img = np.expand_dims(img, axis=0)
img = preprocess_input(img)
return tf.convert_to_tensor(img, dtype=tf.float32)

device = "/GPU:0" if tf.config.list_physical_devices('GPU') else "/CPU:0"
print("Using device:", device)
content_path = "content.jpg"
style_path = "style.jpg"
max_size = 512
content_weight = 1.0
style_weight = 1e6
num_steps = 400
display_step = 50

content_pil = image.load_img(content_path)
if max(content_pil.size) > max_size:
scale = max_size / max(content_pil.size)
final_size = (int(content_pil.size * scale), int(content_pil.size * scale))
else:
final_size = content_pil.size

print(f"最终统一图像尺寸: {final_size}")

content_img = load_and_process_img(content_path, max_size=max_size)
style_img = load_and_process_img(style_path, target_size=final_size, max_size=max_size)
vgg = VGG16(include_top=False, weights='imagenet')
vgg.trainable = False

content_layers = ['block4_conv2']
style_layers = ['block1_conv1', 'block2_conv1', 'block3_conv1', 'block4_conv1']

outputs =
model = Model(vgg.input, outputs)

def gram_matrix(tensor):
x = tf.transpose(tensor, )
b, c, h, w = tf.shape(x)
features = tf.reshape(x, (b, c, h * w))
gram = tf.matmul(features, features, transpose_b=True)
return gram / tf.cast(c * h * w, tf.float32)

def get_features(x):
outs = model(x)
style_outs = outs[:len(style_layers)]
content_outs = outs
return style_outs, content_outs

style_features, content_features = get_features(style_img)
style_grams =
noise = tf.random.uniform(tf.shape(content_img), -20., 20.)
output_img = tf.Variable(content_img + noise)

optimizer = tf.optimizers.Adam(learning_rate=5.0)
def deprocess_img(x):
x = x.numpy()
x[:, :, 0] += 103.939
x[:, :, 1] += 116.779
x[:, :, 2] += 123.68
x = x[:, :, ::-1]# BGR -> RGB
x = np.clip(x, 0, 255).astype('uint8')
return x
output_images = []

for step in range(num_steps):
with tf.GradientTape() as tape:
 style_out, content_out = get_features(output_img)
 # 内容损失
 content_loss = tf.add_n([tf.reduce_mean((a - b) ** 2)
 for a, b in zip(content_out, content_features)])
 # 风格损失
 style_loss = tf.add_n([tf.reduce_mean((gram_matrix(a) - g) ** 2)
 for a, g in zip(style_out, style_grams)])

 total_loss = content_weight * content_loss + style_weight * style_loss

grads = tape.gradient(total_loss, output_img)
optimizer.apply_gradients([(grads, output_img)])

if step % display_step == 0:
 print(f"Step {step}, Total loss: {total_loss:.2f}")
 output_images.append(deprocess_img(output_img))

num_imgs = len(output_images)
cols = 3
rows = (num_imgs + cols - 1) // cols

plt.figure(figsize=(5 * cols, 5 * rows))
for i, img in enumerate(output_images):
plt.subplot(rows, cols, i + 1)
plt.imshow(img)
plt.title(f"Step {i * display_step}")
plt.axis('off')
plt.tight_layout()
plt.show()
</code></pre> 
来源：https://www.cnblogs.com/Goblinscholar/p/19430429

頁: [1]

圆梦公社's Archiver

吴恩达深度学习课程四：计算机视觉 第四周：卷积网络应用 课后习题和代码实践

吴恩达深度学习课程四：计算机视觉第四周：卷积网络应用课后习题和代码实践