Pooling is a local summarization step, not a local pattern-learning step. Your prototype already teaches this well: the learner sees a small window move across an existing feature map, then sees a single summarized value written into a smaller output map. That visual contrast with convolution is the key lesson. Convolution learns a pattern by using trainable weights; pooling compresses an existing map by applying a fixed rule such as maximum or average.
The output becomes spatially smaller, so later layers work with fewer positions and a broader effective view. In this prototype, the beginner should notice the current window, the mapped output cell, the compression ratio, and the fact that the strongest local signal is preserved in max pooling. The supporting text should therefore emphasize spatial compression, local summary, and the absence of learned parameters.