Database Performance Analysis

Project Overview

This project focuses on the architectural evaluation of four distinct database management systems: MongoDB, PostgreSQL, SQLite, and DuckDB. The objective was to build a standardized benchmarking suite that measures performance across CRUD operations, specifically analyzing how data volume, batching strategies, and connection management impact hardware resource utilization (CPU and RAM).

1. High-Fidelity Synthetic Data Generation

The benchmarking process began with the development of a robust data generation engine designed to create realistic datasets for two primary entities: Users and Vehicles. To ensure a high degree of realism, a custom provider system was built on top of the Faker library.

Specialized Data Providers

Identity & Location: Implementation of the DNI (Spanish National ID) checksum algorithm to generate valid identifiers. Location data was curated from a real dataset of Spanish postal codes, linking cities to their specific provinces and generating landline numbers with correct geographic prefixes.
Vehicle Logistics: A custom manufacturer-model hierarchy was developed, alongside two different Spanish license plate formats (the current “0000 AAA” and the historical province-based system). The system also calculated the “registration period” based on realistic plate age data.
Normalization Tools: Integrated unidecode to generate valid email addresses by stripping accents/special characters from names, following common corporate patterns.

Multi-threaded Performance

To scale the dataset to millions of records, the generator utilized Python’s multiprocessing module.

Collision Avoidance: To ensure uniqueness without inter-thread communication overhead, DNI and License Plate intervals were segmented and assigned to specific processes.
Output Optimization: Used ujson and direct csv.DictWriter streams with file locks to minimize serialization bottlenecks during the transition from memory to disk.

2. Standardized Benchmarking Architecture

A significant challenge was creating a fair environment to compare disparate engines like a document-based NoSQL (MongoDB) and a columnar OLAP database (DuckDB). This was solved using an Object-Oriented Design Pattern.

The Abstract DB Interface

A base class DB was implemented to define a uniform interface for all operations:

Connection Management: Logic to toggle between Keep-Alive (persistent connection) and Close-on-Op (opening/closing for every transaction) to measure the impact of connection overhead.
Batching Engine: Configurable batch sizes for INSERT operations, allowing the study of how write-ahead logs (WAL) and memory buffers behave under different pressures.
Caching Layer: Integrated pymemcache as an external caching layer to compare native query performance against memory-object caching.

Database Engine Managers

PostgreSQL: Implemented via SQLAlchemy and psycopg2 to manage remote server interactions.
MongoDB: Utilized pymongo with bulk-write operations for high-concurrency simulation.
SQLite/DuckDB: Leveraged native drivers for local file-based and in-memory performance comparisons.

3. Technical Execution and Resource Monitoring

The project prioritized objective hardware measurement over simple execution time.

Performance Instrumentation

The system tracked three primary metrics per operation:

Real Time (Latency): Calculated using time.perf_counter to capture the end-user experience.
CPU Time (Throughput): Measured via time.process_time to evaluate effective computational work versus IDLE wait states (I/O).
Memory Profiling: Integrated memory_profiler to measure the average and peak RAM consumption during massive inserts and full table scans.

Operation Profiles

The benchmark executed four distinct test scenarios across datasets of $10^3, 10^4,$ and $10^5$ records:

Bulk Insert: Comparing the overhead of relational schema enforcement vs. JSON document flexibility.
Indexed Search: Evaluating B-tree efficiency in relational engines vs. MongoDB’s indexing performance on specific fields like DNI or VIN.
Complex Joins: Measuring the performance of relational JOIN operations against MongoDB’s $lookup aggregation pipeline.
Massive Updates: Analyzing the “In-place” update behavior versus row/column rewriting.

4. Key Implementation Details

Type Mapping: A custom dictionary was implemented to map Pandas data types to specific SQL types across different dialects (e.g., mapping int64 to INTEGER in Postgres and REAL in SQLite).
Index Management: Automatic creation of unique indexes and secondary indexes after data insertion to isolate the cost of index maintenance during writes.
DuckDB Persistence: Specific handling for DuckDB’s columnar format, involving table deletion and recreation for clean-state testing.