ductai05 avatar
Mathematics for AI

Mathematics for AI

Mathematical derivation and practical implementation of core ML algorithms, dimensionality reduction, and advanced Vision-Language Models.

March 1, 2025 → July 1, 2025

Introduction

This project is an amalgamation of extensive research and practical laboratories focusing on the mathematical foundations of Artificial Intelligence. It bridges the gap between theoretical math (Linear Algebra, Calculus, Probability) and practical implementation in Machine Learning and Deep Learning.

The research spans from fundamental dimensionality reduction techniques to state-of-the-art multi-modal Vision-Language Models.


Part 1: PCA and Clustering (Lab 2)

We implemented Principal Component Analysis (PCA) from scratch to understand the mathematics behind dimensionality reduction before applying clustering algorithms.

Mathematical Foundation of PCA

PCA transforms data from its original space to a new space with uncorrelated principal components, retaining the maximum variance.

  1. Z-score Standardization: We first normalize the data: Z=XμσZ = \frac{X - \mu}{\sigma}
  2. Covariance Matrix: We compute the covariance matrix ARd×dA \in \mathbb{R}^{d \times d} to understand the relationship between variables: A=1NXcenteredTXcenteredA = \frac{1}{N} X_{centered}^T X_{centered}
  3. Eigen Decomposition: We find the eigenvalues λi\lambda_i and eigenvectors viv_i of AA where Avi=λiviA v_i = \lambda_i v_i. The eigenvectors corresponding to the largest eigenvalues represent the principal components.
Note (Evaluation Metrics)

We used Explained Variance Ratio (EVR) and Cumulative Explained Variance Ratio (CEVR) to determine the optimal number of principal components kk to retain (usually aiming for 0.95\ge 0.95 CEVR).

Clustering Applications

After applying PCA, we clustered the reduced data using two unsupervised learning algorithms:

Warning (Results on ABIDE II Dataset)

When applied to the complex ABIDE II dataset (autism brain imaging), K-Means achieved 58.1%58.1\% accuracy and GMM achieved 56.7%56.7\%. The low F1-scores highlight the limitations of purely mathematical/statistical transformations on highly complex medical data without deep learning architectures like Graph Convolutional Networks (GCN).


Part 2: Contrastive Language-Image Pretraining (CLIP)

The final project shifted focus to advanced Deep Learning by deeply researching OpenAI’s CLIP model, a foundation model that connects computer vision and natural language processing.

The Power of Natural Language Supervision

Unlike traditional models trained on fixed label sets (e.g., ImageNet’s 1000 classes), CLIP is trained on WIT (WebImageText), a dataset of 400 million image-text pairs. It uses natural language supervision, allowing it to learn highly generalized representations and perform zero-shot classification on completely unseen datasets.

Architecture

CLIP uses a dual-encoder architecture:

  1. Image Encoder: Uses either ResNet (with anti-aliased blur pooling and attention pooling modifications) or Vision Transformer (ViT).
  2. Text Encoder: Uses a Transformer decoder with Masked Self-Attention and Byte Pair Encoding (BPE).

Contrastive Learning

Instead of predicting the exact text caption (generative), CLIP uses a Symmetric Cross-Entropy Loss to maximize the cosine similarity between the NN correct image-text pairs in a batch while minimizing it for the N2NN^2 - N incorrect pairs.

Solution (Application: Face Recognition)

We applied a pretrained CLIP (ViT-B/32) as a feature extractor combined with FAISS k-NN for face recognition on the Labeled Faces in the Wild dataset. This approach achieved significantly better accuracy and generalization (0.730.73 Accuracy) compared to fine-tuning traditional CNNs like ResNet or MobileNet from scratch, proving the robustness of CLIP’s multi-modal embedding space.

Model Comparisons

We also researched and compared CLIP against similar models: