Medical Image Analysis

One-Line Summary: Medical image analysis applies computer vision to radiology, pathology, and ophthalmology, where U-Net architectures dominate segmentation, data is scarce and 3D, and regulatory approval (FDA/CE) gates deployment.

Prerequisites: Convolutional Neural Networks, Image Segmentation, Transfer Learning, Data Augmentation, 3D Vision

What Is Medical Image Analysis?

Consider a radiologist who reads 50--100 CT scans per day, each containing hundreds of slices. Fatigue-related miss rates for subtle lung nodules reach 20--30%. Medical image analysis uses computer vision to assist -- flagging suspicious regions, measuring tumor volumes, or grading disease severity -- not to replace clinicians but to reduce errors and accelerate workflows.

Technically, medical image analysis encompasses classification (disease vs. normal), detection (localizing lesions), segmentation (delineating organ or tumor boundaries), and registration (aligning images across time or modalities). The field operates under unique constraints: 3D volumetric data, extreme class imbalance (most tissue is healthy), limited labeled data, and strict regulatory requirements.

How It Works

Imaging Modalities

Radiology: CT (3D volumes, ~512x512xN voxels, Hounsfield units), MRI (multiple sequences -- T1, T2, FLAIR), X-ray (2D projection).
Pathology: Whole-slide images (WSIs) at 40x magnification, often 100,000 x 100,000 pixels. Processed as patches (e.g., 256x256) due to memory constraints.
Ophthalmology: Fundus photography (2D) and Optical Coherence Tomography (OCT, 3D cross-sections of the retina).
Dermatology: Clinical photographs for lesion classification (melanoma vs. benign).

U-Net and Its Dominance

U-Net (Ronneberger et al., 2015) remains the backbone of medical image segmentation. Its encoder-decoder architecture with skip connections preserves spatial detail critical for precise boundary delineation.

The architecture:

Encoder: Repeated 3x3 convolutions + 2x2 max pooling (4 downsampling steps).
Bottleneck: Deepest feature representation.
Decoder: 2x2 transposed convolutions + concatenation of skip connections from the encoder + 3x3 convolutions.

Variants dominate the field:

3D U-Net (Cicek et al., 2016): Replaces 2D convolutions with 3D for volumetric data.
nnU-Net (Isensee et al., 2021): Auto-configuring U-Net that adapts preprocessing, architecture, and training to each dataset. Won 33 out of 53 medical segmentation challenges without manual tuning.
Attention U-Net: Adds attention gates to skip connections, focusing on relevant regions.
Swin-UNETR: Replaces the CNN encoder with a Swin Transformer for better long-range context.

The Dice loss is standard for medical segmentation, addressing class imbalance between foreground (lesion) and background:

$L_{D i ce} = 1 - \frac{2 \sum _{i} p _{i} g _{i} + ϵ}{\sum _{i} p _{i} + \sum _{i} g _{i} + ϵ}$

where $p_{i}$ and $g_{i}$ are predicted and ground-truth values at voxel $i$ , and $ϵ$ prevents division by zero.

Handling Limited Data

Medical datasets are small (often 100--1,000 cases) due to annotation cost (expert radiologist time) and privacy constraints (HIPAA, GDPR).

Transfer learning: Pretrain on ImageNet (even though natural images differ from medical ones) then fine-tune. This consistently improves performance by 3--10% when data is below 1,000 samples.
Self-supervised pretraining: Methods like Models Genesis or contrastive learning on unlabeled medical images are gaining traction.
Data augmentation: Elastic deformations, random rotations, intensity scaling, and mixup. nnU-Net uses aggressive augmentation by default.
Federated learning: Train across multiple hospitals without sharing patient data. Each site computes gradients locally; only aggregated updates are shared.

3D Volumetric Processing

CT and MRI scans are 3D. Two approaches:

2D slice-by-slice: Process each axial slice independently. Fast but ignores inter-slice context.
3D convolutions: Process the full volume (or patches, e.g., 128x128x128). Captures 3D context but requires 8x more memory per layer than 2D.
2.5D: Use three orthogonal 2D slices (axial, coronal, sagittal) as input channels. A practical compromise.

Regulatory Pathway

In the United States, medical AI software is regulated as a Software as a Medical Device (SaMD) by the FDA:

510(k): Substantial equivalence to a predicate device. Most AI-based tools use this pathway. Average clearance time: 3--6 months.
De Novo: For novel devices without predicates. Longer review process.
PMA (Premarket Approval): For highest-risk (Class III) devices. Rare for AI tools.

As of 2024, the FDA has authorized over 950 AI/ML-enabled medical devices, with radiology accounting for ~75% of approvals.

Why It Matters

AI-assisted screening for diabetic retinopathy (IDx-DR, FDA-approved 2018) enables diagnosis in primary care clinics without an ophthalmologist.
Lung nodule detection in low-dose CT screening reduces radiologist reading time by ~50% while maintaining sensitivity above 95%.
Digital pathology with AI assistance improves cancer grading consistency; human inter-observer agreement for Gleason grading is only ~60%, while AI-assisted agreement reaches ~80%.
Federated learning enables multi-institutional studies without compromising patient privacy, addressing one of the field's biggest bottlenecks.

Key Technical Details

nnU-Net automatically adapts patch size, batch size, and network depth to available GPU memory; trained on a single V100 (32 GB) in most cases.
Dice scores for organ segmentation (liver, kidney) typically exceed 0.95; for tumor segmentation, 0.70--0.85 is considered strong.
A typical chest CT is 512x512x300 voxels at 0.7mm in-plane resolution and 1--3mm slice spacing, requiring ~300 MB per volume in float32.
IDx-DR (autonomous diabetic retinopathy screening) achieved 87.2% sensitivity and 90.7% specificity in the pivotal clinical trial.
Training a 3D U-Net on 200 CT volumes with augmentation typically takes 12--24 hours on a single GPU.

Common Misconceptions

"Medical AI will replace radiologists." Current AI tools are decision-support systems, not autonomous diagnosticians (with rare exceptions like IDx-DR). Regulatory frameworks require physician oversight in almost all cases.
"ImageNet pretraining is useless for medical images." Despite the domain gap, ImageNet-pretrained features consistently outperform random initialization for medical tasks with limited data.
"More data always solves the problem." Label quality matters more than quantity. Expert annotations from board-certified radiologists outperform crowd-sourced labels, even at 10x fewer samples.

Connections to Other Concepts

Image Segmentation: U-Net and its variants are the primary architectures; Dice loss is the standard training objective.
3D Vision: Volumetric medical data requires 3D convolutions, point clouds (for surface meshes), or multi-view processing.
data-augmentation.md: Elastic deformations and intensity augmentations are essential for small medical datasets.
edge-deployment.md: Point-of-care devices in rural clinics require efficient models that run without cloud connectivity.