NLP Corpus for Sentiment Analysis in Game Reviews

Project Overview

This corpus management system provides a complete pipeline for processing BoardGameGeek review data for sentiment analysis. Built with modularity and performance in mind, it handles raw data ingestion, text preprocessing, linguistic analysis, feature extraction, vectorization, and persistence. The architecture emphasizes clean separation of concerns, allowing researchers to swap components (preprocessing strategies, vectorizers, backends) without touching other parts of the system.

Technical Stack

Core Libraries

  • Data Processing: Polars for efficient DataFrame operations, NumPy for numerical computations
  • NLP Frameworks: spaCy for industrial-strength linguistic analysis, NLTK for lightweight preprocessing
  • Machine Learning: Scikit-learn for TF-IDF and count vectorization
  • Sparse Matrices: SciPy for memory-efficient high-dimensional vector storage
  • Multiprocessing: Python’s multiprocessing with shared counters for parallel feature extraction

Key Features

  • Generator-based streaming for memory-efficient data loading
  • Document-level caching with lazy evaluation
  • Dual-backend support (NLTK/spaCy) with automatic fallback
  • Parallel feature extraction with real-time progress tracking
  • Comprehensive persistence layer with format auto-detection

Architecture Highlights

Document-Centric Design

The Document class serves as the central data structure, providing lazy access to raw content and processed forms. Results of expensive operations (tokenization, POS tagging, dependency parsing) are cached at the document level, minimizing redundant computation while keeping the interface clean:

doc = Document(doc_id="game123_user456", raw_text="Great game!", rating=9.0)
tokens = doc.get_tokens()        # First call: processes and caches
tokens = doc.get_tokens()        # Subsequent calls: instant retrieval
pos_tags = doc.get_pos_tags()    # Separate cache for POS tags

Documents store references to shared pipeline components (PreprocessingPipeline, LinguisticAnalyzer), enabling consistent processing across the entire corpus without duplicating configuration.

Chain-of-Responsibility Preprocessing

The PreprocessingPipeline implements a registry-based pattern where each step is an independent class. Steps can be composed into custom pipelines declaratively:

pipeline = PreprocessingPipeline([
    'lowercase',
    'remove_html',
    'expand_contractions',
    'lemmatize_simple',
    'remove_stopwords'  # Preserves negations for sentiment analysis
])

Notable preprocessing features:

  • Context-aware contraction expansion: Handles special cases (“can’t” → “cannot”) before general patterns
  • Sentiment-aware stopword removal: Excludes negation words from stopword list
  • Optional POS-based lemmatization: Uses part-of-speech tags for accurate word normalization

New preprocessing steps can be added to the registry without modifying the pipeline class.

Dual-Backend Linguistic Analysis

The LinguisticAnalyzer abstracts linguistic operations behind a unified interface, supporting both NLTK (lightweight, no model downloads) and spaCy (industrial-strength, requires model):

analyzer = LinguisticAnalyzer(backend='spacy', spacy_model='en_core_web_sm')

Both backends provide:

  • Sentence segmentation
  • Tokenization
  • POS tagging
  • Dependency parsing (spaCy only; NLTK returns placeholder structure)

The analyzer includes specialized methods for sentiment analysis:

  • find_negations(): Identifies negated tokens using dependency relations
  • find_intensifiers(): Detects modifiers like “very” or “extremely”

This enables feature extraction methods to leverage syntactic structure for better sentiment detection (e.g., distinguishing “not good” from “not the”).

Multi-Format Corpus Reader

The CorpusReader handles diverse data formats with automatic detection:

  1. Single CSV file with Polars-based streaming
  2. JSON files in root directory (one per game)
  3. Nested directory structure ({game_id}/reviews.json)

Column name normalization handles variations (“text” vs. “comment”, “user” vs. “user_id”), and the stream_reviews() generator enables processing datasets larger than memory:

for review in reader.stream_reviews(game_ids=['174430', '161936']):
    process_review(review)  # Never loads entire dataset

Parallel Feature Extraction

The FeatureExtractor computes linguistic features for sentiment analysis using multiprocessing with sophisticated progress tracking:

feature_dicts = extractor.extract_features_batch(
    documents,
    show_progress=True,
    n_jobs=8
)

Features include:

  • Opinion word counts (positive/negative using VADER or basic lexicon)
  • Negated opinion words (uses dependency parsing)
  • Intensifiers and mitigators
  • Domain-specific vocabulary
  • Structural features (length, sentence count)
  • VADER compound scores

The parallel implementation uses a shared counter updated by worker processes, with a separate thread in the main process polling the counter to update a tqdm progress bar. This avoids serialization issues and progress bar corruption common in naive multiprocessing approaches.

Flexible Vectorization System

The VectorManager converts text and linguistic features into numerical representations:

TF-IDF Vectorization with optimized parameters:

  • Sublinear term frequency scaling (log transform)
  • N-gram support (unigrams + bigrams)
  • Document frequency filtering (min_df, max_df)

Linguistic Feature Vectorization:

  • Converts feature dictionaries to NumPy arrays
  • Handles type coercion (booleans → integers)
  • Fills missing values with zeros

Feature Combination:

X_combined = vector_manager.combine_features(X_tfidf, feature_matrix)

Automatically converts dense arrays to sparse format before combining, maintaining efficiency for high-dimensional data.

Comprehensive Persistence Layer

The PersistenceManager handles serialization with a clear directory structure:

base_path/
├── raw_data/                    # Original data (never modified)
├── processed_data/              # CSV exports
├── vector_representations/      # NPZ/NPY matrices
└── data_splits/                 # Train/test/val metadata

Smart format selection:

  • Sparse matrices → NPZ (compressed)
  • Dense matrices → NPY or Parquet
  • Tabular data → CSV (Polars-compatible)
  • Arbitrary objects → Pickle

The system protects raw data by always writing processed outputs to separate directories.

Corpus-Level Operations

The Corpus class orchestrates document management with high-level operations:

Label Assignment: Maps numerical ratings to sentiment categories:

corpus = Corpus(label_map={
    'positive': [7, 10],
    'negative': [1, 4]
})
corpus.assign_labels()

Dataset Balancing: Handles class imbalance via undersampling or oversampling

Stratified Splitting: Creates train/test/validation splits while preserving label proportions

Statistics Generation: Computes comprehensive corpus metrics (label distribution, text lengths, unique games, reviews per game)

Filtering: Creates corpus subsets by game IDs without copying documents

Memory Efficiency Strategies

  1. Generator-Based Streaming: Review data yielded one at a time
  2. Sparse Matrix Storage: 200× memory reduction for TF-IDF vectors
  3. Document-Level Caching: Results cached per-document, clearable individually
  4. Polars DataFrames: More efficient than Pandas for CSV operations
  5. Lean Queries: Only required columns loaded from disk

Design Philosophy

The system embodies key software engineering principles:

  • Separation of Concerns: Each class has a single responsibility
  • Open/Closed Principle: Easy to extend (new preprocessing steps, features) without modification
  • Dependency Inversion: Components depend on abstractions, not concrete implementations
  • Lazy Evaluation: Expensive operations deferred until needed, then cached
  • Fail-Fast: Invalid inputs raise descriptive exceptions immediately
  • Reproducibility: Random seeds, deterministic ordering, saved artifacts enable replication

The modular architecture and intelligent caching make it easy to experiment with different preprocessing strategies, feature sets, and vectorization approaches while maintaining clean, maintainable code.

Technologies used

Python
SpaCy
SciPy
Polars
Scikit-Learn