페이지

2022년 2월 5일 토요일

The rules of probability

 At the simplest level, a model, be it for machine learning or more classical method such as linear regression, is a mathematical description of how various kinds of data relate to one another.

In the task of modeling, we usually think about separating the variables of our dataset into two broad classes:

1. Independent data, which primarily means inputs to a model, are denoted by X. These could be categorical features(such as a "0" or "1" in six columns indicating which of six schools a student attends), continuous(such as the heights or text scores of the same students), or ordinal(the rank or a student in the class).

2. Dependent data, conversely, are the outputs of our models, and are denoted by Y.(Note that in some cases Y is label that can be used to condition a generative output, such as in a conditional GAN.) As with the independent variables, these can be continuous, categorical, or ordinal, and they can be an individual element or multidimensional matrix(tensor) for each element of the datast.

So how can we describe the data in our model using statistics? In other words, how can we quantiatively describe what values we are likely to see, and how frequently, and which values are more likely to appear together? One way is by asking the likelihood of observing a particular value in the data, or the probability of that value.

For example, it we were to ask what the probability is of observing a roll of 4 on a six-sided die, the answere is that, on average, we would observe a 4 once every six rols. We wite this as follows:

P(X=4)=1 = 16 = 16.67%

where P denotes probability of.

What defines the allowed probability values for a particular dataset? If we imagine the set of all possible values of a dataset, such as all values of a die, then a probability maps each value to a number between 0 and 1. The minimum is 0 because we can't have a negative chance of seeing a result; the most unlikely result is that we would never see a particular value, or 0% probability, such as rollihng a 7 on a six-side die, Similarly, we can't have greater than 100% probability of observing a result, represented by the value 1; an oucome with probabily 1 is absolutely certain. This set of probability values associated with a dataset belong to discrete classes(such as the faces of a die) or an infinite set of potential values(such as variations in height or weight). In either case, however, these values have to follow certain rules, the Probability Axioms described by the mathematician Andrey Kolmogorov in 1933:

1, The probaility of an observation (a die role, a particular height, and so on) is a non-negative, finite number between 0 and 1.

2. The probability of at least one of the observations in the space of all possible observations occurring is 1.

3. The joint probabiulity of distinct, mutuallyu exclusive events is the sum of the probability of the individual events.

While these rules might seem abstract, you will see in Chapter 3, Building Blocks of Deep Neural Networks, that they have direct erelevance to developing neural network models. For example, an application of rule 1 is to generate the probability between 1 and 0 for a particular outcome in a softmax function for predicting target classes.

Rule 3 is used to normalize these outcomes into the range 0-1, under the guarantee that they are mutually distinct predictions of a deep neural network(in other words, a real-world image logically can't be classified as both a dog and a cat, but rather a dog or a cat, with the probability of these two outcomes additive). Finally, the secound rule provides the theoretical guarantees that we can generate data at all using these models.

However, in the context of machine learning and modeling, we are not usually interested in just the probability of observing a piece of input data, X; we instead want to knoow the conditional probability of an outcome, Y, given the data, X. In other words, we want to know how likely a label is for a set of data, based on that data. We write this as the probability of Y given X, or the probability of Y conditional on X:

P(Y|X)

Another question we could ask about Y and X is how likely they are to occur together or their joint probability, which can be expressed using the preceding conditional probability expression as follows:

P(X,Y) = P(Y|X)P(X) =P(X|Y)(Y)

This formula expressed the probability of X and Y. In the case of X and Y being completely independent of one another, this is simple their product:

P(X|Y)P(Y) = P(Y|X)P(X) = P(X)P(Y)

You will see that these expressions become important in our discussion of complementary priors in Chapter 4, Teaching Networks to Generate Digits, and the ability of restricted Boltzmann machines to simulate independent data samples.

They are also important as building blocks of Bayes' theorem, which we will discuss next.






댓글 없음: