NLP Corpus for Sentiment Analysis in Game Reviews

Project Overview

This corpus management system provides a complete pipeline for processing BoardGameGeek review data for sentiment analysis. Built with modularity and performance in mind, it handles raw data ingestion, text preprocessing, linguistic analysis, feature extraction, vectorization, and persistence. The architecture emphasizes clean separation of concerns, allowing researchers to swap components (preprocessing strategies, vectorizers, backends) without touching other parts of the system.

Technical Stack

Core Libraries

Data Processing: Polars for efficient DataFrame operations, NumPy for numerical computations
NLP Frameworks: spaCy for industrial-strength linguistic analysis, NLTK for lightweight preprocessing
Machine Learning: Scikit-learn for TF-IDF and count vectorization
Sparse Matrices: SciPy for memory-efficient high-dimensional vector storage
Multiprocessing: Python’s multiprocessing with shared counters for parallel feature extraction

Key Features

Generator-based streaming for memory-efficient data loading
Document-level caching with lazy evaluation
Dual-backend support (NLTK/spaCy) with automatic fallback
Parallel feature extraction with real-time progress tracking
Comprehensive persistence layer with format auto-detection

Architecture Highlights

Document-Centric Design

The Document class serves as the central data structure, providing lazy access to raw content and processed forms. Results of expensive operations (tokenization, POS tagging, dependency parsing) are cached at the document level, minimizing redundant computation while keeping the interface clean:

doc = Document(doc_id="game123_user456", raw_text="Great game!", rating=9.0)
tokens = doc.get_tokens()        # First call: processes and caches
tokens = doc.get_tokens()        # Subsequent calls: instant retrieval
pos_tags = doc.get_pos_tags()    # Separate cache for POS tags

Documents store references to shared pipeline components (PreprocessingPipeline, LinguisticAnalyzer), enabling consistent processing across the entire corpus without duplicating configuration.

Chain-of-Responsibility Preprocessing

The PreprocessingPipeline implements a registry-based pattern where each step is an independent class. Steps can be composed into custom pipelines declaratively:

pipeline = PreprocessingPipeline([
    'lowercase',
    'remove_html',
    'expand_contractions',
    'lemmatize_simple',
    'remove_stopwords'  # Preserves negations for sentiment analysis
])

Notable preprocessing features:

Context-aware contraction expansion: Handles special cases (“can’t” → “cannot”) before general patterns
Sentiment-aware stopword removal: Excludes negation words from stopword list
Optional POS-based lemmatization: Uses part-of-speech tags for accurate word normalization

New preprocessing steps can be added to the registry without modifying the pipeline class.

Dual-Backend Linguistic Analysis

The LinguisticAnalyzer abstracts linguistic operations behind a unified interface, supporting both NLTK (lightweight, no model downloads) and spaCy (industrial-strength, requires model):

analyzer = LinguisticAnalyzer(backend='spacy', spacy_model='en_core_web_sm')

Both backends provide:

Sentence segmentation
Tokenization
POS tagging
Dependency parsing (spaCy only; NLTK returns placeholder structure)

The analyzer includes specialized methods for sentiment analysis:

find_negations(): Identifies negated tokens using dependency relations
find_intensifiers(): Detects modifiers like “very” or “extremely”

This enables feature extraction methods to leverage syntactic structure for better sentiment detection (e.g., distinguishing “not good” from “not the”).

Multi-Format Corpus Reader

The CorpusReader handles diverse data formats with automatic detection:

Single CSV file with Polars-based streaming
JSON files in root directory (one per game)
Nested directory structure ({game_id}/reviews.json)

Column name normalization handles variations (“text” vs. “comment”, “user” vs. “user_id”), and the stream_reviews() generator enables processing datasets larger than memory:

for review in reader.stream_reviews(game_ids=['174430', '161936']):
    process_review(review)  # Never loads entire dataset

Parallel Feature Extraction

The FeatureExtractor computes linguistic features for sentiment analysis using multiprocessing with sophisticated progress tracking:

feature_dicts = extractor.extract_features_batch(
    documents,
    show_progress=True,
    n_jobs=8
)

Features include:

Opinion word counts (positive/negative using VADER or basic lexicon)
Negated opinion words (uses dependency parsing)
Intensifiers and mitigators
Domain-specific vocabulary
Structural features (length, sentence count)
VADER compound scores

The parallel implementation uses a shared counter updated by worker processes, with a separate thread in the main process polling the counter to update a tqdm progress bar. This avoids serialization issues and progress bar corruption common in naive multiprocessing approaches.

Flexible Vectorization System

The VectorManager converts text and linguistic features into numerical representations:

TF-IDF Vectorization with optimized parameters:

Sublinear term frequency scaling (log transform)
N-gram support (unigrams + bigrams)
Document frequency filtering (min_df, max_df)

Linguistic Feature Vectorization:

Converts feature dictionaries to NumPy arrays
Handles type coercion (booleans → integers)
Fills missing values with zeros

Feature Combination:

X_combined = vector_manager.combine_features(X_tfidf, feature_matrix)

Automatically converts dense arrays to sparse format before combining, maintaining efficiency for high-dimensional data.

Comprehensive Persistence Layer

The PersistenceManager handles serialization with a clear directory structure:

base_path/
├── raw_data/                    # Original data (never modified)
├── processed_data/              # CSV exports
├── vector_representations/      # NPZ/NPY matrices
└── data_splits/                 # Train/test/val metadata

Smart format selection:

Sparse matrices → NPZ (compressed)
Dense matrices → NPY or Parquet
Tabular data → CSV (Polars-compatible)
Arbitrary objects → Pickle

The system protects raw data by always writing processed outputs to separate directories.

Corpus-Level Operations

The Corpus class orchestrates document management with high-level operations:

Label Assignment: Maps numerical ratings to sentiment categories:

corpus = Corpus(label_map={
    'positive': [7, 10],
    'negative': [1, 4]
})
corpus.assign_labels()

Dataset Balancing: Handles class imbalance via undersampling or oversampling

Stratified Splitting: Creates train/test/validation splits while preserving label proportions

Statistics Generation: Computes comprehensive corpus metrics (label distribution, text lengths, unique games, reviews per game)

Filtering: Creates corpus subsets by game IDs without copying documents

Memory Efficiency Strategies

Generator-Based Streaming: Review data yielded one at a time
Sparse Matrix Storage: 200× memory reduction for TF-IDF vectors
Document-Level Caching: Results cached per-document, clearable individually
Polars DataFrames: More efficient than Pandas for CSV operations
Lean Queries: Only required columns loaded from disk

Design Philosophy

The system embodies key software engineering principles:

Separation of Concerns: Each class has a single responsibility
Open/Closed Principle: Easy to extend (new preprocessing steps, features) without modification
Dependency Inversion: Components depend on abstractions, not concrete implementations
Lazy Evaluation: Expensive operations deferred until needed, then cached
Fail-Fast: Invalid inputs raise descriptive exceptions immediately
Reproducibility: Random seeds, deterministic ordering, saved artifacts enable replication

The modular architecture and intelligent caching make it easy to experiment with different preprocessing strategies, feature sets, and vectorization approaches while maintaining clean, maintainable code.