Definitions - learn from rewards/ punishment of actions

MDP (Markov Decision Process)

Q-learning - q-table

The heuristic would be computed with parameters like the q-value, etc

binary rewards — without actually tracking the pancake, like being blind