Reinforcement Learning System

Project Overview

This project is a comprehensive reinforcement learning suite designed to solve Markov Decision Processes (MDPs), specifically the FrozenLake environment. It bridges the gap between classic tabular methods and modern deep learning approaches. The system allows for comparative analysis between on-policy (SARSA) and off-policy (Q-Learning, DQN) algorithms, featuring a highly optimized Deep Q-Network implementation for efficient training on CPU-constrained environments.

Technical Stack

Core Logic: Python 3.10+
Deep Learning: TensorFlow 2.x and Keras for neural network architecture
Environment: Gymnasium (OpenAI Gym) for standardized RL benchmarks
Mathematics: NumPy for vectorization and Q-table manipulation
Visualization: Matplotlib for training dynamics and JSHTML for frame-by-frame episode animations
Optimization: tf.function JIT compilation for neural network inference

Architecture Highlights

Ultra-Optimized Deep Q-Agent

The UltraOptimizedDQNAgent is engineered to minimize computational overhead while maintaining convergence speed. Key design choices include:

State Mapping: Pre-calculates one-hot encoding for discrete states to avoid redundant computation during the training loop.
Graph Compilation: Uses @tf.function to compile the prediction and target models, significantly reducing Python-to-C++ execution context switching.
Training Throttle: Implements a train_every counter to perform Experience Replay only every N steps, balancing sample efficiency with wall-clock time.
Early Stopping: Monitors the rolling average reward to terminate training once the environment is considered “solved,” preventing over-fitting and resource waste.

# ...existing code...
    def _compile_tf_functions(self):
        """
        Precompila funciones TensorFlow para mayor velocidad
        """
        # Convertir predict en función optimizada de TensorFlow
        self.predict_model = tf.function(lambda x: self.model(x, training=False))
        self.predict_target = tf.function(lambda x: self.target_model(x, training=False))
# ...existing code...

Comparative Benchmarking Framework

The system includes a robust experiment runner (run_experiments) that evaluates multiple algorithms simultaneously. It tracks:

Reward Convergence: Moving averages of rewards per episode block.
Temporal Difference (TD) Error: Average absolute error to monitor value function stability.
Episode Length: Tracking efficiency in finding the shortest path to the goal.

Tabular RL Implementations

The framework provides clean, modular implementations of fundamental RL algorithms:

Q-Learning: Optimal for off-policy learning, updating the Q-value based on the maximum possible future reward.
SARSA: An on-policy alternative that incorporates the agent’s actual exploration policy into the update rule, leading to safer convergence in stochastic environments.

# ...existing code...
            # Actualizar Q-table y estado
            Qtable[state, action] += learning_rate * (
                reward + gamma * np.max(Qtable[next_state]) - Qtable[state, action]
            )
# ...existing code...

Design Philosophy

Performance over Complexity

Instead of using deep architectures (e.g., ResNet), the DQN uses a shallow 2-layer MLP (16-16 units). This choice acknowledges that for discrete environments like FrozenLake (4x4 or 8x8), model capacity is less critical than the frequency of weight updates and the stability of the target network.

Visual Debugging

The inclusion of an animation system allows for qualitative evaluation. By generating jshtml animations of a “greedy” agent, the developer can verify if the agent is stuck in local minima (like falling into the same hole) or if it has successfully learned the optimal path.

Stability Mechanisms

To handle the “moving target” problem in DQN, the system employs a decoupled target_model. The weights are synchronized periodically (update_target_freq), which prevents the oscillation of Q-values during the backpropagation of the loss function.