Course · 13 modules · 120 lessons · 636 min

Computer Vision Concepts

Image fundamentals through CNNs, object detection, segmentation, generative models, vision transformers, and 3D vision.

Image Fundamentals

·Color SpacesA color space is a coordinate system that maps numerical tuples to perceivable colors, with different spaces optimizing for hardware convenience (RGB), perceptual uniformity (CIELAB), or separation of luminance from chrominance (YCbCr, HSV).5 min→·Convolution and FilteringConvolution slides a small kernel (weight matrix) across an image, computing weighted sums at each position to achieve effects like blurring, sharpening, and edge detection -- and is the same operation at the heart of convolutional neural networks.6 min→·Digital Images and PixelsA digital image is a rectangular grid of discrete numerical values (pixels) that approximates a continuous visual scene through spatial sampling and intensity quantization.6 min→·Frequency Domain and Fourier TransformThe Fourier transform decomposes an image into a sum of sinusoidal components at different frequencies and orientations, enabling efficient filtering, compression, and analysis of periodic structures.6 min→·Image HistogramsAn image histogram counts the frequency of each pixel intensity level, providing a compact statistical summary that drives contrast enhancement, thresholding, and exposure analysis.5 min→·Image Interpolation and ResamplingImage interpolation estimates pixel values at non-integer coordinates by combining nearby known samples, enabling image resizing, rotation, warping, and any geometric transformation that maps output pixels to fractional input positions.7 min→·Image Noise and DenoisingImage noise is unwanted random variation in pixel values introduced during capture or transmission, and denoising methods attempt to suppress it while preserving edges and detail -- a fundamental tradeoff that runs through all of image processing.7 min→·Image Pyramids and Scale SpaceImage pyramids and scale-space representations capture an image at multiple resolutions and blur levels, enabling algorithms to detect features and objects regardless of their size in the scene.6 min→·Morphological OperationsMorphological operations use a small structuring element to probe and modify the geometric structure of shapes in binary and grayscale images, enabling noise removal, shape analysis, and feature extraction through operations like erosion, dilation, opening, and closing.6 min→

Feature Extraction And Classical Vision

·Camera Calibration and GeometryCamera calibration determines the intrinsic and extrinsic parameters that define how a camera maps 3D world points to 2D image pixels, based on the pinhole model and its extensions.6 min→·Corner DetectionCorner detection identifies points where image intensity changes sharply in multiple directions, producing stable landmarks for tracking and matching via methods like Harris and Shi-Tomasi.4 min→·Edge DetectionEdge detection identifies boundaries in images where pixel intensity changes sharply, using gradient-based operators like Sobel and multi-stage pipelines like Canny.4 min→·HOG (Histogram of Oriented Gradients)HOG captures local shape and appearance by aggregating gradient orientation histograms over dense spatial cells, forming the classical descriptor behind the breakthrough Dalal-Triggs pedestrian detector.4 min→·Hough TransformThe Hough transform detects parametric shapes (lines, circles, ellipses) by having each edge pixel vote in a parameter space, where peaks correspond to the shapes present in the image.5 min→·Image Stitching and HomographyImage stitching combines overlapping photographs into seamless panoramas by matching features, estimating projective homographies with RANSAC, and blending warped images together.5 min→·Optical FlowOptical flow estimates the per-pixel apparent motion between consecutive video frames, using methods like Lucas-Kanade for sparse tracking and Horn-Schunck for dense fields.5 min→·ORB and Binary DescriptorsORB, BRIEF, and BRISK encode local image patches as compact binary strings compared via Hamming distance, enabling feature matching orders of magnitude faster than floating-point descriptors like SIFT.5 min→·SIFT (Scale-Invariant Feature Transform)SIFT detects keypoints and computes descriptors that remain stable across changes in scale, rotation, and illumination, enabling robust image matching and recognition.4 min→·Template MatchingTemplate matching slides a reference image patch across a target image, computing a similarity score at every position to find where the template appears.4 min→

Convolutional Neural Networks

·AlexNetAlexNet won the 2012 ImageNet Large Scale Visual Recognition Challenge with 16.4% top-5 error, demonstrating that deep convolutional neural networks trained on GPUs could dramatically outperform traditional computer vision methods.7 min→·Convolution in Neural NetworksA convolution layer slides small learned filters across an input, producing feature maps that detect local patterns through weight sharing and local connectivity.6 min→·DenseNetDenseNet connects every layer to every other layer within a dense block, maximizing feature reuse and achieving strong accuracy with substantially fewer parameters than ResNet.5 min→·Depthwise Separable ConvolutionsDepthwise separable convolutions factorize a standard convolution into a spatial depthwise convolution and a channel-wise pointwise convolution, reducing computation by 8--9x with minimal accuracy loss.5 min→·EfficientNetEfficientNet uses compound scaling to uniformly scale network depth, width, and resolution with a fixed ratio, achieving state-of-the-art accuracy-efficiency tradeoffs from a neural architecture search baseline (B0) up to B7.6 min→·Inception (GoogLeNet)The Inception architecture uses parallel multi-scale convolution branches within each module and $1 \times 1$ convolutions for dimensionality reduction, achieving 6.7% top-5 error on ImageNet with only 6.8 million parameters.5 min→·MobileNetMobileNet is a family of efficient CNN architectures built on depthwise separable convolutions, designed for mobile and embedded deployment with tunable width and resolution multipliers.5 min→·Neural Architecture SearchNeural Architecture Search (NAS) automates the design of neural network architectures by searching over a defined space of possible configurations, optimizing for accuracy, latency, or other objectives.6 min→·Pooling LayersPooling layers reduce the spatial dimensions of feature maps by summarizing local regions, providing translation invariance and computational savings.6 min→·Receptive FieldThe receptive field of a neuron is the region of the input image that can influence its activation, growing with network depth through successive convolutions and pooling operations.6 min→·ResNetResNet introduced skip connections that enable identity mappings, allowing successful training of networks up to 152 layers deep and achieving 3.57% top-5 error on ImageNet.5 min→·VGGNetVGGNet demonstrated that network depth with uniform $3 \times 3$ convolutions is a critical factor for representation quality, achieving 7.3% top-5 error on ImageNet with the VGG-16 and VGG-19 architectures.6 min→

Training And Optimization

·Batch NormalizationBatch normalization normalizes activations across the mini-batch at each layer, enabling higher learning rates, faster convergence, and acting as a mild regularizer.4 min→·Data AugmentationData augmentation artificially expands the training set by applying random transformations to images, acting as the cheapest and most effective regularizer available.5 min→·Dropout and RegularizationDropout randomly zeroes neuron activations during training to prevent co-adaptation, while L2 regularization and its variants penalize large weights -- together they are the primary tools for controlling overfitting in deep networks.4 min→·Knowledge DistillationKnowledge distillation transfers the learned behavior of a large teacher network into a smaller student network by training the student to match the teacher's soft output probabilities, capturing inter-class relationships that hard labels miss.5 min→·Label SmoothingLabel smoothing replaces hard one-hot target vectors with soft distributions, preventing the model from becoming overconfident and improving generalization and calibration.4 min→·Learning Rate SchedulingLearning rate scheduling systematically varies the learning rate during training -- typically warming up, then decaying -- to achieve faster convergence and better final accuracy than any fixed rate.5 min→·Mixup and CutMixMixup linearly blends pairs of images and their labels, while CutMix cuts and pastes rectangular regions between images, both producing soft training targets that improve generalization, calibration, and robustness.5 min→·Progressive ResizingProgressive resizing starts training on small images and gradually increases resolution, achieving faster convergence and often better accuracy by providing a natural curriculum from coarse to fine features.5 min→·Self-Supervised PretrainingSelf-supervised pretraining learns visual representations from unlabeled images by solving pretext tasks -- such as predicting masked patches or matching augmented views -- producing features that rival or exceed supervised ImageNet pretraining.6 min→·Transfer LearningTransfer learning reuses features learned on a large source dataset (typically ImageNet) to solve a different target task, eliminating the need to train from scratch and dramatically reducing data and compute requirements.4 min→

Object Detection

·Anchor-Free DetectionAnchor-free detectors eliminate predefined anchor boxes by directly predicting object locations as per-pixel classifications (FCOS) or center-point heatmaps (CenterNet), removing a major source of hyperparameter tuning while matching or exceeding anchor-based accuracy.5 min→·DETR (Detection Transformer)DETR reformulates object detection as a direct set prediction problem using a transformer encoder-decoder architecture with bipartite matching, eliminating the need for anchors, non-maximum suppression, and most hand-designed components.5 min→·Fast R-CNN and Faster R-CNNFast R-CNN shares convolutional computation across all proposals via RoI pooling and trains end-to-end, while Faster R-CNN replaces external proposals with a learned Region Proposal Network (RPN) to achieve near-real-time detection at ~5 FPS.5 min→·Feature Pyramid NetworkFeature Pyramid Networks (FPN) build a multi-scale feature hierarchy by combining top-down semantically strong features with bottom-up spatially precise features through lateral connections, enabling robust detection of objects at all sizes.4 min→·Focal LossFocal loss down-weights the contribution of easy, well-classified examples during training by applying a modulating factor $(1 - p_t)^\gamma$, solving the extreme foreground-background class imbalance that limits single-stage detector accuracy.5 min→·Intersection over UnionIntersection over Union (IoU) measures the overlap between two bounding boxes as the ratio of their intersection area to their union area, serving as the universal metric for evaluating localization quality in object detection.4 min→·Multi-Scale DetectionMulti-scale detection addresses the challenge of recognizing objects that vary enormously in size (from a few pixels to thousands) within a single image, using strategies ranging from image pyramids to feature pyramids to scale-aware architectures.6 min→·Non-Maximum SuppressionNon-maximum suppression (NMS) is a greedy post-processing algorithm that removes duplicate detections by iteratively keeping the highest-scoring box and discarding all boxes that overlap with it above an IoU threshold.5 min→·R-CNNRegion-based Convolutional Neural Network (R-CNN) applies a deep CNN to each of ~2,000 region proposals independently, achieving a dramatic leap in detection accuracy while being prohibitively slow at 47 seconds per image.4 min→·Sliding Window and Region ProposalsSliding windows exhaustively scan every location and scale in an image, while region proposals intelligently suggest a small subset of likely object locations to dramatically reduce computation.4 min→·SSD (Single Shot MultiBox Detector)SSD performs object detection in a single forward pass by predicting bounding boxes and class scores from multiple convolutional feature maps at different scales, achieving 59 FPS with accuracy competitive with two-stage detectors.5 min→·YOLO (You Only Look Once)YOLO frames object detection as a single regression problem from image pixels to bounding box coordinates and class probabilities, enabling real-time detection by processing the entire image in one pass through the network.5 min→

Image Segmentation

·Conditional Random FieldsConditional Random Fields (CRFs) are probabilistic graphical models used as post-processing for segmentation networks, enforcing spatial consistency and refining noisy pixel-level predictions into sharp, boundary-respecting outputs.6 min→·DeepLab and Atrous ConvolutionDeepLab uses dilated (atrous) convolutions and Atrous Spatial Pyramid Pooling (ASPP) to expand the receptive field without reducing spatial resolution, achieving dense prediction with multi-scale context.5 min→·Fully Convolutional NetworksFully Convolutional Networks (FCNs) replace the fully connected layers of classification CNNs with convolutional layers, enabling dense, pixel-wise prediction on inputs of arbitrary spatial size.5 min→·Instance SegmentationInstance segmentation combines object detection and semantic segmentation to produce pixel-level masks for each *individual* object instance in an image, distinguishing between separate objects of the same class.5 min→·Mask R-CNNMask R-CNN extends Faster R-CNN with a parallel mask prediction branch and introduces RoIAlign for pixel-accurate feature extraction, establishing the dominant framework for instance segmentation.6 min→·Panoptic SegmentationPanoptic segmentation unifies semantic segmentation and instance segmentation into a single coherent output, assigning every pixel both a class label and an instance ID -- covering both "stuff" (amorphous regions) and "things" (countable objects).5 min→·Segment AnythingThe Segment Anything Model (SAM) is a foundation model for image segmentation trained on over 1 billion masks, capable of zero-shot, promptable segmentation of any object in any image without task-specific fine-tuning.6 min→·Semantic SegmentationSemantic segmentation assigns a class label to every pixel in an image, producing a dense prediction map that tells you *what* is at each spatial location.4 min→·U-NetU-Net is a symmetric encoder-decoder architecture with skip connections that concatenate encoder features to decoder layers, enabling precise localization from very few training images -- particularly dominant in medical image segmentation.5 min→

Generative Models

·Autoencoders and VAEsAutoencoders learn compressed latent representations by encoding inputs and reconstructing them, while Variational Autoencoders add a probabilistic structure that enables principled generation of new data.5 min→·Diffusion ModelsDiffusion models generate images by learning to reverse a gradual noising process, iteratively denoising random Gaussian noise into coherent images, and have dethroned GANs as the dominant paradigm for image synthesis.5 min→·GAN Training DynamicsTraining GANs is notoriously unstable due to the adversarial minimax objective, with mode collapse and oscillation as primary failure modes, mitigated by architectural and loss function innovations.5 min→·Generative Adversarial NetworksGANs pit a generator network against a discriminator network in a minimax game, producing remarkably realistic synthetic images when the two reach equilibrium.5 min→·Image InpaintingImage inpainting fills in missing or masked regions of an image with plausible content, using contextual reasoning from surrounding pixels through techniques ranging from partial convolutions to diffusion-based generation.6 min→·Image Super-ResolutionImage super-resolution recovers high-resolution detail from low-resolution inputs, evolving from simple CNN upscaling (SRCNN) through GAN-based perceptual methods (SRGAN) to robust real-world models (Real-ESRGAN).5 min→·Image-to-Image TranslationImage-to-image translation learns mappings between visual domains -- from sketches to photos, day to night, horses to zebras -- using paired supervision (Pix2Pix) or unpaired cycle-consistency constraints (CycleGAN).5 min→·Latent Diffusion and Stable DiffusionLatent diffusion models run the diffusion process in a compressed latent space rather than pixel space, dramatically reducing computational cost and enabling practical high-resolution text-to-image generation as realized in Stable Diffusion.5 min→·Neural Style TransferNeural style transfer separates the content and style of images using CNN feature representations -- content captured by activation patterns, style captured by Gram matrices -- enabling artistic rendering of photographs in the style of any painting.6 min→·StyleGANStyleGAN introduces a style-based generator architecture that injects learned styles at each resolution through adaptive instance normalization, enabling unprecedented control over face synthesis at 1024x1024 resolution.6 min→

Vision Transformers

·Attention Mechanisms in VisionApplying self-attention to images requires careful handling of 2D spatial structure, patch size tradeoffs, and the quadratic cost of attention over thousands of visual tokens -- design choices that fundamentally shape every vision Transformer.7 min→·Data-Efficient Image Transformers (DeiT)DeiT demonstrates that Vision Transformers can be trained competitively on ImageNet-1K alone -- without hundreds of millions of private images -- by using knowledge distillation from a CNN teacher and aggressive data augmentation.4 min→·DINO (Self-Distillation with No Labels)DINO trains a Vision Transformer through self-distillation -- a student network learns to match the output of a momentum-updated teacher network on different augmented views of the same image -- producing features that exhibit emergent object segmentation without any labels.5 min→·Hybrid CNN-Transformer ArchitecturesHybrid models use CNN layers for early-stage local feature extraction and Transformer layers for later-stage global reasoning, combining the inductive biases of convolutions with the flexibility of self-attention.5 min→·Masked Image ModelingMasked Image Modeling (MIM) pre-trains vision Transformers by masking a large portion of image patches and training the model to reconstruct the missing content -- either as discrete visual tokens (BEiT) or raw pixels (MAE).6 min→·Swin TransformerThe Swin Transformer computes self-attention within local windows and shifts those windows between layers to achieve hierarchical feature maps and linear computational complexity with respect to image size.5 min→·Vision Transformer (ViT)The Vision Transformer splits an image into fixed-size patches, treats each patch as a token, and processes the sequence with a standard Transformer encoder to perform image classification.5 min→·Vision Transformer ScalingVision Transformers follow predictable scaling laws where performance improves log-linearly with compute and data, but they require substantially more training data than CNNs to reach their potential -- a threshold that, once crossed, allows ViTs to decisively overtake convolutional models.6 min→

Video Understanding

№ 33D Convolutions3D convolutions extend standard 2D spatial filters with a temporal dimension, enabling neural networks to learn spatiotemporal features directly from raw video clips.6 min→·Action RecognitionAction recognition classifies human activities in video clips, evolving from hand-crafted features through two-stream CNNs and 3D convolutions to transformer-based models evaluated on benchmarks like Kinetics, UCF-101, and HMDB-51.6 min→·Optical Flow EstimationOptical flow estimation computes dense per-pixel motion vectors between consecutive video frames, evolving from variational energy minimization to learned architectures like FlowNet, PWC-Net, and RAFT.6 min→·Two-Stream NetworksTwo-stream networks process video through parallel spatial (RGB) and temporal (optical flow) pathways, fusing their predictions to capture both appearance and motion for action recognition.5 min→·Video GenerationVideo generation extends image synthesis to the temporal domain, using diffusion models or autoregressive approaches to produce temporally coherent frame sequences while battling flickering, motion artifacts, and immense computational costs.7 min→·Video Object TrackingVideo object tracking localizes a target object across video frames, encompassing single-object tracking (SOT) with template matching and multi-object tracking (MOT) with detection-and-association pipelines.6 min→·Video RepresentationVideo representation converts raw video into structured tensors suitable for neural networks through frame stacking, temporal differencing, and clip sampling strategies.5 min→·Video TransformersVideo transformers apply self-attention to spatiotemporal tokens extracted from video, achieving strong accuracy but facing a quadratic cost challenge that demands factorized attention strategies.6 min→

3d Vision

№ 33D Gaussian Splatting3D Gaussian Splatting represents scenes as collections of learnable 3D Gaussian primitives that are rasterized via differentiable tile-based splatting, achieving NeRF-quality novel views at real-time rendering speeds exceeding 100 FPS.5 min→№ 33D Object Detection3D object detection localizes objects with oriented 3D bounding boxes (x, y, z, width, height, length, yaw) from LiDAR point clouds, camera images, or fused sensor inputs.5 min→№ 33D Reconstruction3D reconstruction recovers the shape and appearance of objects or scenes from sensor observations, producing explicit representations (meshes, voxel grids, point clouds) or neural implicit surfaces (signed distance functions, occupancy fields).7 min→·Depth EstimationDepth estimation recovers per-pixel distance from the camera to the scene, either from a single image (monocular) or from stereo image pairs, enabling 3D understanding from 2D observations.5 min→·Multi-View GeometryMulti-view geometry provides the mathematical framework for relating 2D image observations from multiple cameras to 3D scene structure, grounded in epipolar geometry, the fundamental matrix, and triangulation.6 min→·Neural Radiance Fields (NeRF)NeRF represents a 3D scene as a continuous volumetric function, implemented by an MLP that maps 5D coordinates (position + viewing direction) to color and density, enabling photorealistic novel view synthesis.5 min→·Point Cloud ProcessingPoint cloud processing handles unordered sets of 3D points acquired from LiDAR, depth cameras, or photogrammetry, using specialized data structures and algorithms for efficient spatial reasoning.5 min→·PointNetPointNet consumes raw, unordered 3D point clouds directly via shared MLPs and a symmetric max-pooling function, bypassing the need for voxelization or mesh conversion.5 min→·SLAM (Simultaneous Localization and Mapping)SLAM simultaneously estimates a sensor's pose (localization) and builds a map of the environment (mapping), solving the chicken-and-egg problem where you need a map to localize and a location to map.6 min→

Multimodal And Foundation Models

·CLIP (Contrastive Language-Image Pretraining)CLIP learns a shared embedding space for images and text by training on 400 million image-text pairs with a contrastive objective, enabling zero-shot visual recognition without task-specific fine-tuning.5 min→·DINOv2DINOv2 is a family of self-supervised Vision Transformers trained by Meta with distillation at scale on 142 million curated images, producing visual features that match or surpass supervised pretraining across diverse downstream tasks without fine-tuning.5 min→·Grounding DINOGrounding DINO combines the DINO detection Transformer with grounded language pretraining to perform open-set object detection, localizing objects in images from arbitrary text descriptions without being limited to predefined categories.6 min→·Image CaptioningImage captioning generates natural language descriptions of images using encoder-decoder architectures that attend to visual regions, evolving from CNN-LSTM models to modern multimodal LLMs like LLaVA and GPT-4V.5 min→·Open-Vocabulary DetectionOpen-vocabulary detection extends object detection beyond fixed label sets by conditioning on arbitrary text queries, enabling detection of any object category described in natural language.5 min→·Text-to-Image GenerationText-to-image generation synthesizes photorealistic or artistic images from natural language prompts using diffusion models guided by vision-language embeddings, with DALL-E, Stable Diffusion, and Midjourney as leading systems.6 min→·Vision Foundation ModelsVision foundation models are large-scale, general-purpose visual backbones -- trained on broad data with self-supervised or language-supervised objectives -- that transfer to a wide range of downstream tasks without task-specific architecture changes.6 min→·Visual Question Answering (VQA)Visual question answering requires models to answer free-form natural language questions about images, demanding joint reasoning over visual content and linguistic structure.6 min→·Zero-Shot ClassificationZero-shot classification recognizes visual categories never seen during training by using natural language descriptions as class prototypes in a shared vision-language embedding space.5 min→

Applications And Deployment

·Anomaly DetectionVisual anomaly detection learns the distribution of "normal" images and flags deviations, using methods like PatchCore and student-teacher networks to detect manufacturing defects without requiring defect examples during training.6 min→·Autonomous Driving PerceptionAutonomous driving perception fuses cameras, LiDAR, and radar to build a real-time 3D understanding of the driving environment, using Bird's-Eye View representations and increasingly end-to-end architectures.5 min→·Edge DeploymentEdge deployment runs computer vision models on mobile phones, embedded devices, and microcontrollers by applying quantization, pruning, and compiler optimizations to meet strict latency and power budgets.5 min→·Face Detection and RecognitionFace detection locates faces in images while face recognition maps them to identities, evolving from Viola-Jones cascades to deep embedding models like ArcFace that achieve >99.8% verification accuracy.5 min→·Image Classification in PracticeDeploying image classification beyond academic benchmarks requires handling class imbalance, domain shift, fine-grained distinctions, and serving millions of predictions per second.5 min→·Image RetrievalImage retrieval finds visually similar images in a database by encoding images as compact embedding vectors and performing approximate nearest neighbor search, powered by metric learning and contrastive losses.5 min→·Medical Image AnalysisMedical image analysis applies computer vision to radiology, pathology, and ophthalmology, where U-Net architectures dominate segmentation, data is scarce and 3D, and regulatory approval (FDA/CE) gates deployment.5 min→·OCR and Document UnderstandingOptical Character Recognition (OCR) detects and recognizes text in images, while document understanding extends this to parsing layouts, tables, and semantic structure for automated information extraction.5 min→

Evaluation And Datasets

·Benchmark LeaderboardsBenchmark leaderboards -- tracked by Papers With Code, COCO, and ImageNet evaluation servers -- standardize model comparison, drive competitive progress, and shape research priorities, but also introduce biases toward benchmark-specific optimization.6 min→·Classification MetricsClassification metrics -- accuracy, precision, recall, F1, and their variants -- quantify model performance from different angles, with the choice of metric depending on class balance, error costs, and deployment context.5 min→·Detection MetricsObject detection evaluation uses mean Average Precision (mAP), computed over precision-recall curves at various IoU thresholds, with the COCO protocol (AP@[.50:.05:.95]) as the standard benchmark.6 min→·Generative Model MetricsGenerative model quality is measured by FID (distribution distance, lower is better), Inception Score (diversity and quality), CLIP Score (text-image alignment), LPIPS (perceptual similarity), and KID (unbiased small-sample alternative to FID).6 min→·Landmark DatasetsLandmark datasets -- ImageNet (1.2M images, 1K classes), COCO (330K images, 80 categories), Pascal VOC, ADE20K, Cityscapes, and Open Images -- define the benchmarks that drive computer vision progress and shape architectural design decisions.6 min→·Segmentation MetricsSegmentation is evaluated using mean Intersection over Union (mIoU) for semantic tasks, Dice/F1 for medical imaging, pixel accuracy for basic assessment, and Panoptic Quality (PQ = SQ x RQ) for unified panoptic evaluation.6 min→