Obesity Risk Factor Analysis in R

Obesity Level Classification and Analysis

Project Overview

This project analyzes obesity levels using multiple statistical and machine learning techniques on a dataset containing demographic, dietary, and lifestyle information. The analysis implements dimensionality reduction, classification algorithms, and clustering methods to understand and predict obesity categories.

Methodology

1. Data Visualization

Initial exploratory analysis of numerical variables:

  • Age, Height, Weight: demographic characteristics
  • FCVC (Frequency of vegetable consumption), NCP (Number of main meals), CH2O (Water consumption)
  • FAF (Physical activity frequency), TUE (Time using technology)

Distribution histograms and pairwise correlation plots were generated to identify relationships between variables and obesity levels.

2. Principal Component Analysis (PCA)

PCA was applied to reduce dimensionality while preserving variance:

  • Numerical variables were standardized
  • Categorical variables (Gender, family_history_with_overweight, FAVC, CAEC, SMOKE, SCC, CALC, MTRANS) were treated as supplementary qualitative variables
  • 8 principal components were extracted
  • Scree plot visualized variance explained by each component
  • Individual projections were colored by obesity level to identify separation patterns

3. Linear Discriminant Analysis (LDA)

LDA was implemented for multiclass classification:

  • Only numerical variables were retained for the model
  • Data split: 70% training, 30% testing
  • Projection onto discriminant functions (LD1 and LD2) visualized class separation
  • Confusion matrix and accuracy metrics evaluated model performance

4. Naive Bayes Classifier

A probabilistic approach assuming feature independence:

  • Continuous variables with excessive precision were rounded to treat them as categorical
  • Only Age, Height, and Weight remained as continuous variables (normalized)
  • All other variables converted to factors
  • Model trained on 70% of data, tested on remaining 30%
  • Accuracy calculated from confusion matrix

5. K-Means Clustering

Unsupervised learning to discover natural groupings:

  • All categorical variables encoded numerically
  • Optimal cluster number determined using:
    • Elbow method: identifies diminishing returns in within-cluster sum of squares
    • Silhouette method: measures cluster cohesion and separation
  • K-Means applied for k=2 and k=3 clusters
  • Results visualized via PCA projection and pairwise variable plots
  • Clusters compared against actual obesity levels to assess alignment

Risk Factors Analysis (from PDF)

The analysis identifies key obesity risk factors across multiple dimensions:

Dietary Factors:

  • High calorie food consumption (FAVC)
  • Low vegetable intake (FCVC)
  • Eating patterns (CAEC, NCP)
  • Water consumption (CH2O)

Behavioral Factors:

  • Physical activity frequency (FAF)
  • Technology usage time (TUE)
  • Smoking habits (SMOKE)
  • Calorie monitoring (SCC)

Genetic/Demographic:

  • Family history of overweight
  • Age, gender, height, weight

These factors collectively contribute to obesity classification across seven levels: Insufficient Weight, Normal Weight, Overweight Level I-II, and Obesity Type I-III.

Results Summary

  • PCA: Effectively reduced dimensionality while maintaining class distinctions
  • LDA: Provided strong classification accuracy with clear discriminant functions
  • Naive Bayes: Achieved competitive performance despite independence assumption
  • Clustering: Revealed natural groupings partially aligned with obesity categories

Technologies used

R