Obesity Level Classification and Analysis

Project Overview

This project analyzes obesity levels using multiple statistical and machine learning techniques on a dataset containing demographic, dietary, and lifestyle information. The analysis implements dimensionality reduction, classification algorithms, and clustering methods to understand and predict obesity categories.

Methodology

1. Data Visualization

Initial exploratory analysis of numerical variables:

Age, Height, Weight: demographic characteristics
FCVC (Frequency of vegetable consumption), NCP (Number of main meals), CH2O (Water consumption)
FAF (Physical activity frequency), TUE (Time using technology)

Distribution histograms and pairwise correlation plots were generated to identify relationships between variables and obesity levels.

2. Principal Component Analysis (PCA)

PCA was applied to reduce dimensionality while preserving variance:

Numerical variables were standardized
Categorical variables (Gender, family_history_with_overweight, FAVC, CAEC, SMOKE, SCC, CALC, MTRANS) were treated as supplementary qualitative variables
8 principal components were extracted
Scree plot visualized variance explained by each component
Individual projections were colored by obesity level to identify separation patterns

3. Linear Discriminant Analysis (LDA)

LDA was implemented for multiclass classification:

Only numerical variables were retained for the model
Data split: 70% training, 30% testing
Projection onto discriminant functions (LD1 and LD2) visualized class separation
Confusion matrix and accuracy metrics evaluated model performance

4. Naive Bayes Classifier

A probabilistic approach assuming feature independence:

Continuous variables with excessive precision were rounded to treat them as categorical
Only Age, Height, and Weight remained as continuous variables (normalized)
All other variables converted to factors
Model trained on 70% of data, tested on remaining 30%
Accuracy calculated from confusion matrix

5. K-Means Clustering

Unsupervised learning to discover natural groupings:

All categorical variables encoded numerically
Optimal cluster number determined using:
- Elbow method: identifies diminishing returns in within-cluster sum of squares
- Silhouette method: measures cluster cohesion and separation
K-Means applied for k=2 and k=3 clusters
Results visualized via PCA projection and pairwise variable plots
Clusters compared against actual obesity levels to assess alignment

Risk Factors Analysis (from PDF)

The analysis identifies key obesity risk factors across multiple dimensions:

Dietary Factors:

High calorie food consumption (FAVC)
Low vegetable intake (FCVC)
Eating patterns (CAEC, NCP)
Water consumption (CH2O)

Behavioral Factors:

Physical activity frequency (FAF)
Technology usage time (TUE)
Smoking habits (SMOKE)
Calorie monitoring (SCC)

Genetic/Demographic:

Family history of overweight
Age, gender, height, weight

These factors collectively contribute to obesity classification across seven levels: Insufficient Weight, Normal Weight, Overweight Level I-II, and Obesity Type I-III.

Results Summary

PCA: Effectively reduced dimensionality while maintaining class distinctions
LDA: Provided strong classification accuracy with clear discriminant functions
Naive Bayes: Achieved competitive performance despite independence assumption
Clustering: Revealed natural groupings partially aligned with obesity categories

Obesity Risk Factor Analysis in R