Database Performance Analysis
Project Overview
This project focuses on the architectural evaluation of four distinct database management systems: MongoDB, PostgreSQL, SQLite, and DuckDB. The objective was to build a standardized benchmarking suite that measures performance across CRUD operations, specifically analyzing how data volume, batching strategies, and connection management impact hardware resource utilization (CPU and RAM).
1. High-Fidelity Synthetic Data Generation
The benchmarking process began with the development of a robust data generation engine designed to create realistic datasets for two primary entities: Users and Vehicles. To ensure a high degree of realism, a custom provider system was built on top of the Faker library.
Specialized Data Providers
- Identity & Location: Implementation of the DNI (Spanish National ID) checksum algorithm to generate valid identifiers. Location data was curated from a real dataset of Spanish postal codes, linking cities to their specific provinces and generating landline numbers with correct geographic prefixes.
- Vehicle Logistics: A custom manufacturer-model hierarchy was developed, alongside two different Spanish license plate formats (the current “0000 AAA” and the historical province-based system). The system also calculated the “registration period” based on realistic plate age data.
- Normalization Tools: Integrated
unidecodeto generate valid email addresses by stripping accents/special characters from names, following common corporate patterns.
Multi-threaded Performance
To scale the dataset to millions of records, the generator utilized Python’s multiprocessing module.
- Collision Avoidance: To ensure uniqueness without inter-thread communication overhead, DNI and License Plate intervals were segmented and assigned to specific processes.
- Output Optimization: Used
ujsonand directcsv.DictWriterstreams with file locks to minimize serialization bottlenecks during the transition from memory to disk.
2. Standardized Benchmarking Architecture
A significant challenge was creating a fair environment to compare disparate engines like a document-based NoSQL (MongoDB) and a columnar OLAP database (DuckDB). This was solved using an Object-Oriented Design Pattern.
The Abstract DB Interface
A base class DB was implemented to define a uniform interface for all operations:
- Connection Management: Logic to toggle between
Keep-Alive(persistent connection) andClose-on-Op(opening/closing for every transaction) to measure the impact of connection overhead. - Batching Engine: Configurable batch sizes for
INSERToperations, allowing the study of how write-ahead logs (WAL) and memory buffers behave under different pressures. - Caching Layer: Integrated
pymemcacheas an external caching layer to compare native query performance against memory-object caching.
Database Engine Managers
- PostgreSQL: Implemented via
SQLAlchemyandpsycopg2to manage remote server interactions. - MongoDB: Utilized
pymongowith bulk-write operations for high-concurrency simulation. - SQLite/DuckDB: Leveraged native drivers for local file-based and in-memory performance comparisons.
3. Technical Execution and Resource Monitoring
The project prioritized objective hardware measurement over simple execution time.
Performance Instrumentation
The system tracked three primary metrics per operation:
- Real Time (Latency): Calculated using
time.perf_counterto capture the end-user experience. - CPU Time (Throughput): Measured via
time.process_timeto evaluate effective computational work versus IDLE wait states (I/O). - Memory Profiling: Integrated
memory_profilerto measure the average and peak RAM consumption during massive inserts and full table scans.
Operation Profiles
The benchmark executed four distinct test scenarios across datasets of and records:
- Bulk Insert: Comparing the overhead of relational schema enforcement vs. JSON document flexibility.
- Indexed Search: Evaluating B-tree efficiency in relational engines vs. MongoDB’s indexing performance on specific fields like DNI or VIN.
- Complex Joins: Measuring the performance of relational
JOINoperations against MongoDB’s$lookupaggregation pipeline. - Massive Updates: Analyzing the “In-place” update behavior versus row/column rewriting.
4. Key Implementation Details
- Type Mapping: A custom dictionary was implemented to map Pandas data types to specific SQL types across different dialects (e.g., mapping
int64toINTEGERin Postgres andREALin SQLite). - Index Management: Automatic creation of unique indexes and secondary indexes after data insertion to isolate the cost of index maintenance during writes.
- DuckDB Persistence: Specific handling for DuckDB’s columnar format, involving table deletion and recreation for clean-state testing.