Reinforcement Learning System
Project Overview
This project is a comprehensive reinforcement learning suite designed to solve Markov Decision Processes (MDPs), specifically the FrozenLake environment. It bridges the gap between classic tabular methods and modern deep learning approaches. The system allows for comparative analysis between on-policy (SARSA) and off-policy (Q-Learning, DQN) algorithms, featuring a highly optimized Deep Q-Network implementation for efficient training on CPU-constrained environments.
Technical Stack
- Core Logic: Python 3.10+
- Deep Learning: TensorFlow 2.x and Keras for neural network architecture
- Environment: Gymnasium (OpenAI Gym) for standardized RL benchmarks
- Mathematics: NumPy for vectorization and Q-table manipulation
- Visualization: Matplotlib for training dynamics and JSHTML for frame-by-frame episode animations
- Optimization:
tf.functionJIT compilation for neural network inference
Architecture Highlights
Ultra-Optimized Deep Q-Agent
The UltraOptimizedDQNAgent is engineered to minimize computational overhead while maintaining convergence speed. Key design choices include:
- State Mapping: Pre-calculates one-hot encoding for discrete states to avoid redundant computation during the training loop.
- Graph Compilation: Uses
@tf.functionto compile the prediction and target models, significantly reducing Python-to-C++ execution context switching. - Training Throttle: Implements a
train_everycounter to perform Experience Replay only every N steps, balancing sample efficiency with wall-clock time. - Early Stopping: Monitors the rolling average reward to terminate training once the environment is considered “solved,” preventing over-fitting and resource waste.
# ...existing code...
def _compile_tf_functions(self):
"""
Precompila funciones TensorFlow para mayor velocidad
"""
# Convertir predict en función optimizada de TensorFlow
self.predict_model = tf.function(lambda x: self.model(x, training=False))
self.predict_target = tf.function(lambda x: self.target_model(x, training=False))
# ...existing code...
Comparative Benchmarking Framework
The system includes a robust experiment runner (run_experiments) that evaluates multiple algorithms simultaneously. It tracks:
- Reward Convergence: Moving averages of rewards per episode block.
- Temporal Difference (TD) Error: Average absolute error to monitor value function stability.
- Episode Length: Tracking efficiency in finding the shortest path to the goal.
Tabular RL Implementations
The framework provides clean, modular implementations of fundamental RL algorithms:
- Q-Learning: Optimal for off-policy learning, updating the Q-value based on the maximum possible future reward.
- SARSA: An on-policy alternative that incorporates the agent’s actual exploration policy into the update rule, leading to safer convergence in stochastic environments.
# ...existing code...
# Actualizar Q-table y estado
Qtable[state, action] += learning_rate * (
reward + gamma * np.max(Qtable[next_state]) - Qtable[state, action]
)
# ...existing code...
Design Philosophy
Performance over Complexity
Instead of using deep architectures (e.g., ResNet), the DQN uses a shallow 2-layer MLP (16-16 units). This choice acknowledges that for discrete environments like FrozenLake (4x4 or 8x8), model capacity is less critical than the frequency of weight updates and the stability of the target network.
Visual Debugging
The inclusion of an animation system allows for qualitative evaluation. By generating jshtml animations of a “greedy” agent, the developer can verify if the agent is stuck in local minima (like falling into the same hole) or if it has successfully learned the optimal path.
Stability Mechanisms
To handle the “moving target” problem in DQN, the system employs a decoupled target_model. The weights are synchronized periodically (update_target_freq), which prevents the oscillation of Q-values during the backpropagation of the loss function.