Depthwise Separable Convolutions

One-Line Summary: Depthwise separable convolutions factorize a standard convolution into a spatial depthwise convolution and a channel-wise pointwise convolution, reducing computation by 8--9x with minimal accuracy loss.

Prerequisites: Convolution in neural networks, 1x1 convolutions, computational cost of convolutions

What Is Depthwise Separable Convolution?

Consider washing a stack of colored plates. A standard approach is to scrub each plate with a sponge that handles both the circular scrubbing motion (spatial) and the color-specific soap (channel) simultaneously. A depthwise separable approach splits this into two steps: first, scrub each plate individually with a plain sponge (spatial filtering per channel), then apply the color-specific treatment across all plates at once (channel mixing). By decoupling where to look from what to combine, each step becomes simpler and cheaper.

Technically, a standard convolution with kernel size $k \times k$ , $C_{in}$ input channels, and $C_{o u t}$ output channels applies $C_{o u t}$ filters, each of size $k \times k \times C_{in}$ . A depthwise separable convolution breaks this into:

Depthwise convolution: Apply one $k \times k$ filter per input channel (total: $C_{in}$ filters), producing $C_{in}$ feature maps.
Pointwise convolution: Apply $C_{o u t}$ filters of size $1 \times 1 \times C_{in}$ to mix channels, producing the final $C_{o u t}$ feature maps.

How It Works

Standard Convolution Cost

For an input of size $H \times W \times C_{in}$ and $C_{o u t}$ output channels with kernel size $k$ :

$Cost_{standard} = H \cdot W \cdot C_{in} \cdot C_{o u t} \cdot k^{2}$

Depthwise Convolution

Each of the $C_{in}$ channels is convolved independently with its own $k \times k$ filter:

$Cost_{depthwise} = H \cdot W \cdot C_{in} \cdot k^{2}$

This produces $C_{in}$ output feature maps. No inter-channel information is mixed at this stage. In group convolution terms, depthwise convolution is a grouped convolution with $g = C_{in}$ (one group per channel).

Pointwise Convolution

A standard $1 \times 1$ convolution combines the $C_{in}$ depthwise outputs into $C_{o u t}$ channels:

$Cost_{pointwise} = H \cdot W \cdot C_{in} \cdot C_{o u t}$

Total Reduction

$Cost_{separable} = H \cdot W \cdot C_{in} \cdot (k^{2} + C_{o u t})$

The ratio of separable to standard convolution cost:

$\frac{Cost _{separable}}{Cost _{standard}} = \frac{k ^{2} + C _{o u t}}{k ^{2} \cdot C _{o u t}} = \frac{1}{C _{o u t}} + \frac{1}{k ^{2}}$

For $k = 3$ and $C_{o u t} = 256$ : ratio $= \frac{1}{256} + \frac{1}{9} \approx 0.115$ , which is approximately an 8.7x reduction in computation. For larger $C_{o u t}$ , the savings approach $\frac{1}{k ^{2}}$ (9x for $3 \times 3$ kernels).

Parameter Reduction

Standard: $k^{2} \cdot C_{in} \cdot C_{o u t}$ parameters. Separable: $k^{2} \cdot C_{in} + C_{in} \cdot C_{o u t}$ parameters.

For $k = 3$ , $C_{in} = C_{o u t} = 256$ : Standard has $589, 824$ params; separable has $67, 840$ -- an 8.7x reduction.

Implementation Details

# PyTorch implementation
import torch.nn as nn
 
# Depthwise convolution: groups = C_in
depthwise = nn.Conv2d(
    in_channels=64, out_channels=64,
    kernel_size=3, padding=1, groups=64  # key: groups = in_channels
)
 
# Pointwise convolution: 1x1
pointwise = nn.Conv2d(
    in_channels=64, out_channels=128,
    kernel_size=1
)

The groups parameter in nn.Conv2d controls the grouping. Setting groups=in_channels produces a depthwise convolution.

Batch Normalization and Activation

In practice, BN and ReLU (or ReLU6, swish) are applied after each stage:

Input -> DepthwiseConv -> BN -> ReLU -> PointwiseConv -> BN -> ReLU -> Output

Some architectures (MobileNetV2) apply a linear bottleneck after the pointwise convolution, omitting the final ReLU to prevent information loss in low-dimensional spaces.

Why It Matters

Enables mobile and edge deployment. The 8--9x computation reduction makes it feasible to run CNNs on smartphones, IoT devices, and other resource-constrained hardware.
Foundation of efficient architectures. MobileNet, Xception, EfficientNet, and many NAS-discovered architectures rely on depthwise separable convolutions as their core building block.
Minimal accuracy trade-off. On ImageNet, MobileNetV1 with depthwise separable convolutions achieves 70.6% top-1 accuracy with only 569M FLOPs, compared to VGG-16 at 71.5% with 15,500M FLOPs -- a 27x compute reduction for less than 1% accuracy loss.
Influenced hardware design. Mobile NPUs and DSPs now include specific optimizations for depthwise convolutions due to their prevalence in deployed models.

Key Technical Details

The depthwise stage accounts for only about 3% of total computation but ~33% of total runtime on GPUs due to poor arithmetic intensity (low ratio of computation to memory access).
On CPUs and DSPs, depthwise separable convolutions are proportionally faster because these platforms are more memory-bound and the reduced computation translates directly to speedup.
Xception (Chollet, 2017) uses depthwise separable convolutions throughout and achieves 79.0% top-1 accuracy on ImageNet (vs. Inception V3's 78.8%) with the same parameter count.
The depthwise filter has $k^{2}$ parameters per channel, which is quite small. This low parameter count per channel can limit the expressive power of the spatial filtering step.
Grouped convolutions with group sizes between 1 and $C_{in}$ provide a tunable middle ground between standard and depthwise convolutions.

Common Misconceptions

"Depthwise separable convolutions are always faster." On GPUs, the depthwise stage has very low arithmetic intensity, making it memory-bound and difficult to parallelize efficiently. The theoretical 8-9x FLOP reduction often translates to only 2-3x wall-clock speedup on GPUs without specialized kernels.
"The two stages are just a mathematical trick." The factorization assumes that spatial and channel correlations can be learned independently, which is a genuine modeling constraint. Chollet (2017) argued this is actually a beneficial inductive bias, as the Xception results suggest.
"Depthwise separable convolutions cannot match standard convolution accuracy." With sufficient width and depth, networks using depthwise separable convolutions can match or exceed standard convolution networks (Xception vs. Inception V3, EfficientNet vs. ResNet).

Connections to Other Concepts

convolution-in-neural-networks.md: Depthwise separable convolution is a strict factorization of the standard convolution operation into spatial and channel components.
inception.md: Inception's use of $1 \times 1$ convolutions for dimensionality reduction is conceptually related; depthwise separable convolutions take the factorization further by making every spatial convolution single-channel.
mobilenet.md: Built entirely on depthwise separable convolutions, applying them systematically for mobile-efficient architectures.
efficientnet.md: Uses depthwise separable convolutions as the primary operator within its mobile inverted bottleneck blocks.