MDP (Markov Decision Process)
Q-learning - q-table
The heuristic would be computed with parameters like the q-value, etc
binary rewards — without actually tracking the pancake, like being blind