ETL Pipeline for Social Media Analysis with Spark and Hadoop

Project Overview

This project implements a distributed ETL (Extract, Transform, Load) pipeline using Apache Spark and Hadoop. It demonstrates the transition from low-level RDD (Resilient Distributed Dataset) manipulations to high-level Spark SQL analytics for processing textured social media data from platforms like Reddit.

Technical Implementation

Distributed Extraction

Data is ingested from distributed storage or remote text sources. Using Spark’s sc.textFile, large volumes of unstructured data (such as the Spanish classic El Quijote for NLP benchmarking) are loaded into the cluster as RDDs for parallel processing.

RDD Transformations & Actions

The pipeline implements core functional programming patterns to clean and structure data:

Transformations: Utilizes map, flatMap, and filter for tokenization and cleaning.
Key-Value Operations: Employs reduceByKey and join to calculate word frequencies and correlate datasets across different partitions.
Optimization: Implements coalesce and repartition to manage data movement across the cluster and cache()/persist() to minimize re-computation during iterative actions.

Spark SQL & Dataframes

For structured analysis, the pipeline converts raw JSON data (Reddit submissions) into Spark DataFrames.

Aggregations: Computes user activity metrics and subreddit statistics using groupBy and agg.
Custom Logic: Defines User Defined Functions (UDFs) for complex text metrics, such as character counting and regex-based filtering.
SQL Interoperability: Declares temporary views enabling complex data querying via standard SQL syntax.

Analysis & Visualization

The pipeline integrates with pyspark.pandas (formerly Koalas) to bridge the gap between distributed computing and data science visualization. It generates insights such as:

User posting frequency distributions.
Subreddit-specific WordClouds to identify trending topics.
Distribution of content length using logarithmic scaling.

Key Features

Scalability: Designed to run on Spark clusters with Hadoop-compatible filesystems.
Efficiency: Uses broadcast variables and accumulators to optimize cluster communication.
Fault Tolerance: Leverages Spark’s lineage-based RDDs to ensure resilience against node failure.