NLP Corpus for Sentiment Analysis in Game Reviews
Project Overview
This corpus management system provides a complete pipeline for processing BoardGameGeek review data for sentiment analysis. Built with modularity and performance in mind, it handles raw data ingestion, text preprocessing, linguistic analysis, feature extraction, vectorization, and persistence. The architecture emphasizes clean separation of concerns, allowing researchers to swap components (preprocessing strategies, vectorizers, backends) without touching other parts of the system.
Technical Stack
Core Libraries
- Data Processing: Polars for efficient DataFrame operations, NumPy for numerical computations
- NLP Frameworks: spaCy for industrial-strength linguistic analysis, NLTK for lightweight preprocessing
- Machine Learning: Scikit-learn for TF-IDF and count vectorization
- Sparse Matrices: SciPy for memory-efficient high-dimensional vector storage
- Multiprocessing: Python’s multiprocessing with shared counters for parallel feature extraction
Key Features
- Generator-based streaming for memory-efficient data loading
- Document-level caching with lazy evaluation
- Dual-backend support (NLTK/spaCy) with automatic fallback
- Parallel feature extraction with real-time progress tracking
- Comprehensive persistence layer with format auto-detection
Architecture Highlights
Document-Centric Design
The Document class serves as the central data structure, providing lazy access to raw content and processed forms. Results of expensive operations (tokenization, POS tagging, dependency parsing) are cached at the document level, minimizing redundant computation while keeping the interface clean:
doc = Document(doc_id="game123_user456", raw_text="Great game!", rating=9.0)
tokens = doc.get_tokens() # First call: processes and caches
tokens = doc.get_tokens() # Subsequent calls: instant retrieval
pos_tags = doc.get_pos_tags() # Separate cache for POS tags
Documents store references to shared pipeline components (PreprocessingPipeline, LinguisticAnalyzer), enabling consistent processing across the entire corpus without duplicating configuration.
Chain-of-Responsibility Preprocessing
The PreprocessingPipeline implements a registry-based pattern where each step is an independent class. Steps can be composed into custom pipelines declaratively:
pipeline = PreprocessingPipeline([
'lowercase',
'remove_html',
'expand_contractions',
'lemmatize_simple',
'remove_stopwords' # Preserves negations for sentiment analysis
])
Notable preprocessing features:
- Context-aware contraction expansion: Handles special cases (“can’t” → “cannot”) before general patterns
- Sentiment-aware stopword removal: Excludes negation words from stopword list
- Optional POS-based lemmatization: Uses part-of-speech tags for accurate word normalization
New preprocessing steps can be added to the registry without modifying the pipeline class.
Dual-Backend Linguistic Analysis
The LinguisticAnalyzer abstracts linguistic operations behind a unified interface, supporting both NLTK (lightweight, no model downloads) and spaCy (industrial-strength, requires model):
analyzer = LinguisticAnalyzer(backend='spacy', spacy_model='en_core_web_sm')
Both backends provide:
- Sentence segmentation
- Tokenization
- POS tagging
- Dependency parsing (spaCy only; NLTK returns placeholder structure)
The analyzer includes specialized methods for sentiment analysis:
find_negations(): Identifies negated tokens using dependency relationsfind_intensifiers(): Detects modifiers like “very” or “extremely”
This enables feature extraction methods to leverage syntactic structure for better sentiment detection (e.g., distinguishing “not good” from “not the”).
Multi-Format Corpus Reader
The CorpusReader handles diverse data formats with automatic detection:
- Single CSV file with Polars-based streaming
- JSON files in root directory (one per game)
- Nested directory structure (
{game_id}/reviews.json)
Column name normalization handles variations (“text” vs. “comment”, “user” vs. “user_id”), and the stream_reviews() generator enables processing datasets larger than memory:
for review in reader.stream_reviews(game_ids=['174430', '161936']):
process_review(review) # Never loads entire dataset
Parallel Feature Extraction
The FeatureExtractor computes linguistic features for sentiment analysis using multiprocessing with sophisticated progress tracking:
feature_dicts = extractor.extract_features_batch(
documents,
show_progress=True,
n_jobs=8
)
Features include:
- Opinion word counts (positive/negative using VADER or basic lexicon)
- Negated opinion words (uses dependency parsing)
- Intensifiers and mitigators
- Domain-specific vocabulary
- Structural features (length, sentence count)
- VADER compound scores
The parallel implementation uses a shared counter updated by worker processes, with a separate thread in the main process polling the counter to update a tqdm progress bar. This avoids serialization issues and progress bar corruption common in naive multiprocessing approaches.
Flexible Vectorization System
The VectorManager converts text and linguistic features into numerical representations:
TF-IDF Vectorization with optimized parameters:
- Sublinear term frequency scaling (log transform)
- N-gram support (unigrams + bigrams)
- Document frequency filtering (min_df, max_df)
Linguistic Feature Vectorization:
- Converts feature dictionaries to NumPy arrays
- Handles type coercion (booleans → integers)
- Fills missing values with zeros
Feature Combination:
X_combined = vector_manager.combine_features(X_tfidf, feature_matrix)
Automatically converts dense arrays to sparse format before combining, maintaining efficiency for high-dimensional data.
Comprehensive Persistence Layer
The PersistenceManager handles serialization with a clear directory structure:
base_path/
├── raw_data/ # Original data (never modified)
├── processed_data/ # CSV exports
├── vector_representations/ # NPZ/NPY matrices
└── data_splits/ # Train/test/val metadata
Smart format selection:
- Sparse matrices → NPZ (compressed)
- Dense matrices → NPY or Parquet
- Tabular data → CSV (Polars-compatible)
- Arbitrary objects → Pickle
The system protects raw data by always writing processed outputs to separate directories.
Corpus-Level Operations
The Corpus class orchestrates document management with high-level operations:
Label Assignment: Maps numerical ratings to sentiment categories:
corpus = Corpus(label_map={
'positive': [7, 10],
'negative': [1, 4]
})
corpus.assign_labels()
Dataset Balancing: Handles class imbalance via undersampling or oversampling
Stratified Splitting: Creates train/test/validation splits while preserving label proportions
Statistics Generation: Computes comprehensive corpus metrics (label distribution, text lengths, unique games, reviews per game)
Filtering: Creates corpus subsets by game IDs without copying documents
Memory Efficiency Strategies
- Generator-Based Streaming: Review data yielded one at a time
- Sparse Matrix Storage: 200× memory reduction for TF-IDF vectors
- Document-Level Caching: Results cached per-document, clearable individually
- Polars DataFrames: More efficient than Pandas for CSV operations
- Lean Queries: Only required columns loaded from disk
Design Philosophy
The system embodies key software engineering principles:
- Separation of Concerns: Each class has a single responsibility
- Open/Closed Principle: Easy to extend (new preprocessing steps, features) without modification
- Dependency Inversion: Components depend on abstractions, not concrete implementations
- Lazy Evaluation: Expensive operations deferred until needed, then cached
- Fail-Fast: Invalid inputs raise descriptive exceptions immediately
- Reproducibility: Random seeds, deterministic ordering, saved artifacts enable replication
The modular architecture and intelligent caching make it easy to experiment with different preprocessing strategies, feature sets, and vectorization approaches while maintaining clean, maintainable code.