**Introduction**

In this article we provide a general understanding of sequential prediction, with a particular attention to adversarial models. The aim is to provide theoretical foundations to the problem and discuss real life applications.

As concrete examples consider the following three real-life situations:

1. Every Sunday you can choose either to go on trip to Costa Brava or to study in the library. To go on trip you need to buy tickets the day before. You want to be sure that you will enjoy a sunny day. Basing your decision on a single weather forecast might not be a good idea and you consider multiple forecasts. How to take the “best” possible action in the presence of experts’ advise is an issue that we address in this article.

2. After getting too much rain at the beach and studying on sunny days, you gave up on going on trip but to compensate you decide to go out for a dinner every Saturday. Your objective now is to pick the best possible restaurant based on your preferences – or most likely on your girlfriend’s preference – but you know very few restaurants in Barcelona. We will address this issue by studying a strategy for the so called multi-armed bandit problem.

3. Every month some friends are visiting you. Once they arrive you always have to recommend a restaurant where they can go on themselves – while you are studying in the library. Once they go to the restaurant you ask for a feedback, but you always receive partial or even no feedbacks from your lazy friends. This problem can be modeled as a partial monitoring multi-armed bandit problem that we describe in the last session.

For completeness of the article we provided a proof to our statements whenever this would not make the discussion too heavy. The reader might skip the proof if not interested.

**Sequential prediction: General Set Up**

Before entering into the discussion we provide a general set up. Considers being the outcome space. The forecaster chooses an action where is the decision space and she has a loss at time , . To be concrete, in Example 1 the loss might be 1 if you either you went to the beach when raining or you were studying with sunny weather, 0 otherwise.

In an oblivious game the environment chooses the outcome regardless of the strategy of the opponent. We define the cumulative loss of the forecaster. Without loss of generality we can impose that .

The regret at time for choosing action instead of action is defined as . A natural objective function is the average maximum regret faced by the forecaster, defined as: .

Considering again Example 1, if you went to the beach on a rainy day but all did the same because all experts told you that it was going to be sunny your regret would be simply 0.

Forecasting strategies that guarantee that the average regret goes to zero almost surely for all possible strategies of the environment are defined *Hannan consistent*.

**Prediction with experts advice**

We start our discussion first considering the context of regret minimization for prediction with experts advice [3] under a convex loss function. With an abuse of notation we consider the decision space being of each expert .

*Prediction protocol with convex loss function*

For each round

1. the environment chooses the next outcome without reveling it;

2. each expert reveals its prediction to the forecaster;

3. the forecaster chooses ;

4. the environment reveals ;

5. the forecaster suffers the loss and each expert suffers a loss

Consider the following strategy:

with

, , .

Then this strategy is consistent at a rate proportional to .

This results show that by assigning exponential weights to each forecaster you will end up with a strategy consistent at optimal rate.

Notice that this result strongly relies on the assumption of convexity on the loss function. In the next session we relax this assumption and show that you can get similar results.

*Proof.*

Considers log() = log() – log() = . Then a lower bound can be defined as

Notice then that , with . Given that the loss is bounded between 0,1, by Hoeffding inequality:

By Jensen’s inequality

Therefore we get . Rearrenging things we get

By the first order conditions on the upper bound we choose and by substituting the term we get:

Sharper bounds can be obtained by imposing some structure on the behavior of the loss function.

**New Prediction Protocol**

For each round

1. the environment chooses the next outcome without reveling it;

2. each expert reveals its prediction to the forecaster;

3. the forecaster chooses a probability vector over the set of M actions and draws an action with

4. the environment reveals ;

5. the forecaster suffers the loss and each expert suffers a loss

In this case the average regret is bounded by with probability at least .

Again the strategy is consistent at optimal rate. The importance of this result relies on the fact that now we have a consistent strategy for deciding which weather forecasts believe to for any possible loss we have in mind! In the next session we go through a technical proof that the reader might prefer to skip if not interested in the details of the theorem.

*Proof.*

To prove the result we first make use of the following lemma:

*Lemma 1.* Let being a random variable such that , where is the filtration at time . Then by Hoeffding-Azuma inequality

Proof of Lemma 1.* By the Chernoff bound for some . By using the law of iterated expectations and by Hoeffding inequality
*

The argument can be repeated times and get . Using this result and minimizing over gives the result shown in the previous expression.

We can now move to the actual proof. Define where is the filtration at time . Furthermore, notice that is a martingale and has expectation . Using Hoeffding-Azuma

w.p. , therefore the loss are concentrated around expectation. Notice now that is convex (linear in this case) in the first variable. Therefore by the previous result

By adding and subtracting :

which concludes the proof.

**Multi-Armed Bandit Problem**

We now move to the problem of choosing the best restaurant without knowing many restaurants and without the help of TripAdvisor!

Consider the following prediction protocol:

Prediction Protocol: Multi-Armed Bandit Problem

For each round

1. the environment chooses the next outcome without reveling it;

2. the forecaster chooses a probability vector over the set of M actions and draws an action ;

3. the forecaster suffers the loss ;

4. only is reveled to the forecaster, the loss for all other actions remain unknown.

The objective function of the forecaster remains the regret. Clearly the situation is much more challenging, provided that there is not common knowledge of the loss incurred at every time by each expert. We define the following unbiased estimator:

where is the probability of choosing action at time and is the indicator variable equal to 1 if is true, 0 otherwise. Notice that

**A forecasting strategy in Multi-Armed Bandit Problem**

We define the gain and the estimated unbiased gain. Notice that is at most 1, a property used for a martingale-type bound. Choose . Initialize .

For each round

1. Select an action according to the probability distribution ;

2. calculate the estimated gain:

3. update the weights ;

4. update the probabilities with exponential weights.

Note that by introducing a parameter we give up the unbiasedness of the estimate to guarantee that the estimated cumulative gains are, with

large probability, not much smaller than the actual cumulative gains.

Under conditions of theorem 6.10 [2] the regret is again. Therefore even without having no clue about what the loss would have been by going to new restaurants the strategy is consistent at optimal rate!

**Discussion**

We want to stress that the main ingredients for an optimal rate of convergency in probability are contained in the exploration-exploitation trade off. In fact notice that

then the first term multiplying by contains information regarding the losses of the actions taken in the past. The second term instead let the forecaster have non-zero probabilities for exploring new actions. In practice, the strategy give you a guide for how many times you should explore going to new restaurants and how many times you should go to good restaurants where you have already been.

Partial Monitoring Multi-Armed Bandit Problem

As a further motivating example of a partial monitoring regret minimization problem consider the following dynamic pricing model.

A vendor sell a product to customers one by one. She can select a different price for each customer but no barganing is allowed and no further information can be exchanged between the buyer and the seller. Assume that the willingness to pay of each buyer is , the actual price offered to the seller is and the loss incurred by the seller at time is

with . The seller can only observe whether the customer buys or not the product and has no clue about the empirical distribution of . A natural question is whether it exists a randomized strategy for the seller such that the average regret is Hannan consistent.

In a more general setting we define the following prediction protocol:

Prediction Protocol: Partial Monitoring Multi-Armed Bandit Problem

For each round

1. the environment chooses the next outcome without reveling it;

2. the forecaster chooses a probability vector over the set of M actions and draws an action ;

3. the forecaster suffers the loss ;

4. only a feedback is reveled to the forecaster.

The losses of the forecaster can be summurized in the loss matrix . With no loss of generality . At every iteration the forecaster chooses an action , suffers a loss but she only observes a feedback parametrized by a given feedback function that assigns to each action/outcome pair an element of a finite set of signals. The values are collected in the feedback matrix . Notice that the forecaster at time has access only to the information . In [1] the following strategy was shown to be Hannan consistent at a sub-optimal rate .

Assume that , that is , considering and having the same rank. Define and as an unbiased estimator of the loss:

with being the probability of having chosen action at time and . Initialize . For each round

1. Let and ;

2. choose an action from the set of actions at random according to the distribution defined by

3. let for all

In [1] the authors shown that under some mild conditions the strategy has a performance bound with a magnitude proportional to , that is with a convergency rate . Interestingly in the scenario of a simple multi-armed bandit problem, whenever , the theorem leads to a bound of order , much slower compared to the result obtained in the previous section. Finding the class of problems for which this bound can be improved remains in fact a challenging research question

**Discussion**

We have just shown that it exists a consistent strategy even without having full information on the loss that you incurred, as in the case of the third example in the introductory session, but at a slower rate.