When I was a child, my friends and I used to come up with challenges like eating a yoghurt hands-free or cookie speed-eating. We all had 30 seconds to think of a strategy and then the challenge started. Cookie crumbs flew left and right and the yoghurt rarely ended up where it was supposed to, landing instead in noses, eyes and even ears. The initial strategies we devised were rarely successful and so the real challenge was to constantly improve on them or invent even better ones. We did so by trial and error while attempting to achieve some success with the best strategy we’d come up with in the meantime. To take the cookie challenge as an example, given the size, amount and dryness of the cookies, it was common to alternate between taking small bites in quick succession and gobbling them down whole in order to determine which strategy was most effective.
In its general form, this approach is used widely to solve any problem with an as yet unknown optimum solution, not just by children playing around. This trial-and-error approach, seeking to maximise the gains over a defined period of time, is a common learning paradigm. It plays a central role in the machine learning discipline of reinforcement learning, and is generally referred to as exploration-exploitation.
Let us consider the following example to illustrate the general problem more closely: you have been invited to participate in a game show and are faced with ten boxes filled with different amounts of money. The game consists of ten rounds. During each one, you must choose one of the boxes to discover and receive its previously unknown contents. The now empty box is filled with the same amount as before and you get to play again. Your goal is evidently to obtain as much money as possible. Intuitively, given the constraint of having only ten rounds, we would seek to choose the combination of boxes which would yield the largest amount of money, but the problem is that we still do not know the amounts in each box. The only way to explore the contents of a box is to actually choose it and discover its contents. How do you know if the chosen box is the one with the highest amount of money in it? Only by knowing the contents of the other boxes. Once you have determined the desired box, you can exploit your knowledge by repeatedly choosing that box.
Technically speaking, you want to maximise the total gains over a finite amount of time by choosing the right actions from a vast action pool. Every new action (e.g. choose box no. 7) results in a well-defined but previously unknown outcome, its gains (e.g. the amount of money in box no. 7). You want to choose actions so that the sum of their gains is maximised, and we say that an action performs well if it results in higher gains than the majority of the remaining actions. The exploration-exploitation paradigm is a strategy that attempts to maximise the total gains. It performs exploration steps which harvest information about the available actions, i.e. the action pool, and exploitation steps which seek to maximise the gains using the information acquired:
- Exploration is defined as choosing unknown actions and thus determining their gains. You discover how well, or poorly an action performs, and you become increasingly able to assess the average gains resulting from actions. This allows you to estimate whether an action performs well or poorly compared to the overall action pool without knowing the gains resulting from every single action.
- Exploitation is defined as choosing the action that, in your view, performs the best.
Deciding when to explore and when to exploit is not an easy task, and is subject to research in reinforcement learning. A fundamental trade-off underlying exploration and exploitation exists, which is sometimes termed the exploration-exploitation dilemma. The more you explore, the less you exploit and vice versa. This is a result of the fact that you can either explore or exploit, but cannot do both simultaneously. An entire theory on how to best approach this in different settings exists, but, essentially, it all comes down to intuition: if you’re lacking concrete information about the actions, you explore. Once you have enough information and think you have found the best or at least a well performing action, you exploit.
You have received another invitation to participate in the game show, but this time the rules are slightly different: in each round you can either open a new box or choose to increase the amount of money in a box which is already open. If you decide on the latter, a random amount is added to the money in that box. At the end of the ten rounds, you receive the box with the highest amount of money in it (of course only opened boxes are considered). In order to maximise the amount of money you receive at the end of the game, you want to find the box with the highest-value contents as quickly as possible in order to have a maximum timespan left to increase the amount of money in that box.
Generally speaking, you are no longer interested in maximising the total gains, but instead in maximising the individual gains resulting from your actions. This means that actions no longer have fixed gains, but you are able to randomly improve their performance. In addition to the a priori unknown gains resulting from the actions, you are unaware, beforehand, of how good the improvements will be. Therefore, let us define exploration-optimisation as a variant of exploration-exploitation:
- Exploration is defined as choosing a new action and determining its gains.
- Optimisation is defined as choosing a known action and improving on it.
Exploration-optimisation comes with the same trade-off as exploration-exploitation, since one cannot simultaneously explore and improve. The following section investigates the structure of a system which relies on an exploration and optimisation approach in order to achieve its goals. After applying the principle, we will make a connection between the actual system and the principle of exploration-optimisation.
Taken to Practice
In my last post on how programmers and learners can mutually benefit from open source code documentation, I introduced an approach describing how to use beneficial synergies to improve the creation of open source code documentation and the learning experience of online coding tutorials. Open source code is integrated into online coding tutorials as reading exercises so that the tutorial’s users learn to read and understand code written by other people. In return, the learner’s task is to provide documentation relating to the piece of code they have been working on, which can subsequently be integrated into the open source projects to enhance usability and user-friendliness.
Specifically, the intermediary system facilitating this beneficial synergy aims to create a set of different documentation versions for a given piece of software in order to increase the odds of producing accurate and understandable documentation. After obtaining several different versions of documentation for the respective software, we must determine which of these is best. However, it would be unrealistic to assume that these initial documentation versions are already as accurate and comprehensible as we would like them to be, so we seek to improve them further. It follows that we aim to improve existing documentation, simultaneously determining how well it documents the respective code.
The main goals are as follows:
- Creation of different versions of documentation for the given code
- Systematic improvement on the documentation available
- Rating the various documentation versions to determine the highest-performing solution
The intermediary system consists of two distinct parts which include mechanisms designed to achieve the above mentioned goals. The first part is a solution generator responsible for the documentation’s initial creation. The second part, a feedback loop, is responsible for rating and improving on the different versions of the documentation.
Furthermore, we will assume that the Input step has already occurred, consisting of open source code being uploaded to the intermediary platform and tagged appropriately (e.g. programming language, dependencies, level of difficulty, etc.). Later on, the tags are used by the coding tutorials to navigate and filter the available code. Every time a piece of code is requested by a coding tutorial, the Distribution step takes place and the intermediary system has to decide whether to launch the solution generator or the feedback loop.
After this brief introduction to the mechanisms used by our intermediary system, let us take a step back in order to relate the abstract concept of exploration-optimisation to the concrete steps of the solution generator and the feedback loop. By defining our set of different documentation versions as the action pool and the gains of an action as its performance score (describing how well that documentation fits the code), we can, in formal terms, describe our search for effective documentation as an exploration-optimisation problem:
- Exploration is implemented by the solution generator, and constitutes the creation of new solutions. This is required as an initialisation process and after poorly-performing solutions have been discarded.
- Optimisation is implemented by the feedback loop and primarily constitutes the rating of documentation versions to determine their gains, as well as of the improvement of said documentation. Part of the optimisation process is the decision to discard poor solutions in order to allow for further exploration.
In theory, exploration-optimisation can be used to systematically improve on any kind of user-generated solution by defining the action pool appropriately, provided that mechanisms to determine the gains of your actions exist, as well as the means to alter these mechanisms accordingly. Thus, exploration-optimisation provides a conceptual and abstract framework for building systems which aim to determine optimum solutions for a given problem in an iterative manner. The intermediary system I have described in this article is a concrete example of how to transition from the conceptual framework to a real system. I believe its mechanisms and functionalities can be extrapolated and applied to a vast range of similar problems. I look forward to seeing the solutions and approaches involving exploration-optimisation developed by others in future.