Limit(0)

2022년 2월 11일 금요일

Discriminative and generative modeling and Bayues' theorem

Now let's consider h ow these rules of conditional and joint probability relate to the kinds of predictive models that we build for various machine learning applications. In most cases-such as predicting whether an email is fraudulent or the dollar amount of the future lifetime value of a customer-we are interested in the conditional probability, P(Y|X=x), where Y is the set of outcomes we are tryuing to model, X represents the input features, and x is a patricular value of the input features. As discussed, this approach is known as discriminative modeling.

discriminative modeling attempts to learn a direct mapping between the data, X, and the outcomes, Y.

Another way to understand discriminative modeling is in the context of Bayers theorem, which relates the conditional an joint probabilities of adataset:

P(Y|X) = P(X|Y)P(X) . P(X) = P(X,Y) . P(X)

In Bayues' formula, the expression P(X|Y) . P(X) is known as the likelihood or the supporting evidence that the observation X gives to the likehood of observing Y.

P(Y) is the prior or the plausibility of the outcome, and P(Y|X) is the posterior or the probaility of the plausibility of the outcome, and P(Y|X).P(X) is the posterior or the probability of the outcome given all the independent data we have observed related to the outcome thus far. Conceptually, Bayes' theorem states that the probability of an outcome is the product of its baseline probability and the probability of the input data conditional on this outcome.

The theorem was published two years after the author's death, and in a foreword Richard Price described it as a mathematical argument for the existence of God, which was perhaps appropriate given that Thomas Bayes served as a reverend during his life.

In the context of discriminative learning, we can thus see that a discriminative model directly computes the posterior; we could have a model of the likelihood or prior, but it is not required in this approach. Even though you may not have realized it, most of the models you have probably used in the machine learning toolkit are discriminative, such as the following:

- Linear regression

- Logistic regression

- Random forests

-Gradient-boosted decision trees(GBDT)

- Support vector machines(SVM)

The first two (linear and logistic regression) model the outcome, Y, conditional on the data, X, using a normal or Gaussian (linear regression) or sigmoidal(logistic regression) probability function. In contrast, the last three have no formal probability model-they compute a function (an ensemble of trees for random forests or GDBT, or an inner product distribution for SVM) that maps X to Y, using a loss or error function to tune those estimates, Given this nonparametric nature, some authors have argued that these constitute a separate class of non-model discriminative algorithms.

In contrast, a generative model attempts to learn the joint distribution P(Y,X) of the labels and the input data. Recall that using the definition of joint probability:

P(X,Y) = P(X|Y)P(Y)

We can rewrite Bayes' theorem as follows:

P(Y|X) = P(X,Y).P(X)

Instead of learning a direct mapping of X to Y using P(X|Y), as in the discriminative case, our goal is to model the joint probabilities of X and Y using P(X,Y). While we can use the resulting joint distribution of X and Y to compute the posterior, P(Y|X), and learn a targeted model, we can also use this distribution to sample new instances of the data by either jointly sampling new tuples(x,y), or sampling new data inputs using a target label, Y, with the following expression:

P(X|Y=y) = P(X,Y).P(Y)

Examples of generative models include the following:

- Naive BVayes classifiers

-Gaussian mixture models

-Latent Dirichlet Allocation(LDA)

-Hidden Markov models

-Deep Boltzmann machines

-VAEs

-GANs

Naive Bayes classifiers, though named as a discriminative model, utilize Bayes' theorem to learn the joint distribution of X and Y under the assumption that the X variables are independent, Similarly, Gaussian imxture models describe the likelihood of a data point belonging to one of a group of normal distributions using the joint probability of the label and these distributions.

LDA represents a document as the joint probability of a word and a set of underlying keyword lists (topics) that are used in a document, Hidden Markov models express the joint probability of a state and the next state of data, such as the weather on successive days of the week. As you will see in Chapter 4, Teaching Networks to Generate Digits, deep Boltzmann machines learn the joint probability of a label and the data vector it is associated with. The VAE and GAN models we will cover in Chapters 5,6,7, and 11 also utilize joint distributions to map between complex data types. This mapping allows us to generate data from random vectors or transform one kind of data into another.

As already mentioned, another view of generative models is that they allow us to generate samples of X if we know an outcome, Y. In the first four models in the previous list, this conditional probability is just a component of the model formula, with the posterior estimates still being the ultimate objective. However, in the last three examples, which are all deep neural network models, learning the conditional of X dependent upon a hidden, or latent, variable, Z, is actually the main objective, in order to generate new data samples, Using the rich structure allowed by multilayered neural networks, these models can approximate the distribution of complex data types such as images, natural language, and sound. Also, instead of being a target value, Z, is often a random umber in these applications, serving merely as an input from which to generate a large space of hypothetical data points. To the extent we have a label (such as whether a generated image should be of a dog or dolphin, or the genre of a generated song), the model is P(X|Y=y, Z=z), where the label Y controls the generation of data that is otherwise unrestricted by the random nature of Z.

2022년 2월 5일 토요일

The rules of probability

At the simplest level, a model, be it for machine learning or more classical method such as linear regression, is a mathematical description of how various kinds of data relate to one another.

In the task of modeling, we usually think about separating the variables of our dataset into two broad classes:

1. Independent data, which primarily means inputs to a model, are denoted by X. These could be categorical features(such as a "0" or "1" in six columns indicating which of six schools a student attends), continuous(such as the heights or text scores of the same students), or ordinal(the rank or a student in the class).

2. Dependent data, conversely, are the outputs of our models, and are denoted by Y.(Note that in some cases Y is label that can be used to condition a generative output, such as in a conditional GAN.) As with the independent variables, these can be continuous, categorical, or ordinal, and they can be an individual element or multidimensional matrix(tensor) for each element of the datast.

So how can we describe the data in our model using statistics? In other words, how can we quantiatively describe what values we are likely to see, and how frequently, and which values are more likely to appear together? One way is by asking the likelihood of observing a particular value in the data, or the probability of that value.

For example, it we were to ask what the probability is of observing a roll of 4 on a six-sided die, the answere is that, on average, we would observe a 4 once every six rols. We wite this as follows:

P(X=4)=1 = 16 = 16.67%

where P denotes probability of.

What defines the allowed probability values for a particular dataset? If we imagine the set of all possible values of a dataset, such as all values of a die, then a probability maps each value to a number between 0 and 1. The minimum is 0 because we can't have a negative chance of seeing a result; the most unlikely result is that we would never see a particular value, or 0% probability, such as rollihng a 7 on a six-side die, Similarly, we can't have greater than 100% probability of observing a result, represented by the value 1; an oucome with probabily 1 is absolutely certain. This set of probability values associated with a dataset belong to discrete classes(such as the faces of a die) or an infinite set of potential values(such as variations in height or weight). In either case, however, these values have to follow certain rules, the Probability Axioms described by the mathematician Andrey Kolmogorov in 1933:

1, The probaility of an observation (a die role, a particular height, and so on) is a non-negative, finite number between 0 and 1.

2. The probability of at least one of the observations in the space of all possible observations occurring is 1.

3. The joint probabiulity of distinct, mutuallyu exclusive events is the sum of the probability of the individual events.

While these rules might seem abstract, you will see in Chapter 3, Building Blocks of Deep Neural Networks, that they have direct erelevance to developing neural network models. For example, an application of rule 1 is to generate the probability between 1 and 0 for a particular outcome in a softmax function for predicting target classes.

Rule 3 is used to normalize these outcomes into the range 0-1, under the guarantee that they are mutually distinct predictions of a deep neural network(in other words, a real-world image logically can't be classified as both a dog and a cat, but rather a dog or a cat, with the probability of these two outcomes additive). Finally, the secound rule provides the theoretical guarantees that we can generate data at all using these models.

However, in the context of machine learning and modeling, we are not usually interested in just the probability of observing a piece of input data, X; we instead want to knoow the conditional probability of an outcome, Y, given the data, X. In other words, we want to know how likely a label is for a set of data, based on that data. We write this as the probability of Y given X, or the probability of Y conditional on X:

P(Y|X)

Another question we could ask about Y and X is how likely they are to occur together or their joint probability, which can be expressed using the preceding conditional probability expression as follows:

P(X,Y) = P(Y|X)P(X) =P(X|Y)(Y)

This formula expressed the probability of X and Y. In the case of X and Y being completely independent of one another, this is simple their product:

P(X|Y)P(Y) = P(Y|X)P(X) = P(X)P(Y)

You will see that these expressions become important in our discussion of complementary priors in Chapter 4, Teaching Networks to Generate Digits, and the ability of restricted Boltzmann machines to simulate independent data samples.

They are also important as building blocks of Bayes' theorem, which we will discuss next.

2022년 2월 4일 금요일

Implementing generative models

While generative models could theoretically be implemented using a wide variety of machine learning algorithms, in practice, they are usually built with deep neural networks, which are well suited to capturing complex variations in data such as images or language.

In this book, we will focus on implementing theses deep generative models for many different applications using TensorFlow2.0. TensorFlow is a C++ framework, with APIs in the Python programming language, used to develop and productionize deep learning modles. It was open sourced by Google in 2013, and has become one of the most popular libraries for the research and deployment of neural network models.

With the 2.0 release, much of the boilerplate code that characterized development in earlier versions of the library was cleaned up with high-level abstractions, allowing us to focus on the model rather than the plumbing of the computations.

The latest version also introduced the concept of eager execution, allowing network computations to be run on demand, which will be an important benefit of implementing some of our models.

In upcoming chapters, yuuou will learn not only the underlying theory behind these models, but the practical skills needed to implement them in popular programming frameworks. In Chapter 2, Setting up a TensorFlow Lab, you will learn how to set up a cloud environment that will allow you to run TensorFlow in a distributed fashion, using the Kubeflow framework to catalog your experiments.

Indeed, as I will describe in more detail in Chapter 3, Building B locks of Deep Neural Networks, since 2006 an explosion of research into deep learning using large neural network models has produced a wide variety of generative modeling applications.

The first of these was the restricted Boltzmann machine, which is stached in multiple layers to create a deep belief network. I will describe both of these models in Chapter 4, Teaching Networks to Generate Digits. Later innovations incluided Variational Autoencoders(VAEs), which can efficiently generate complex data samples from random numbers, using techniques that I will describe in Chapter 5, Painting Pictures with Neural Networks Using VAEs.

We will also cover the algorithm used to create The Portrait of Edmond Belamy, the GAN, in more detail in Chapter6, Image Generation with GANs, of this book.

Conceptually, the GAN model creates a competition between two neural networks. One(termed the generator) produces realistic(or, in the case of the experiments by Obvious, artistic) images starting from a set of random numbers and applying a mathematical transformation. In a sense, the generator is like an art student, producing new paintings from brushstrokes and creative inspiration.

The second network, known as the discriminator, attempts to classify whether a picture comes from a set of real-world images, or whether it was created by the generator. Thus, the discriminator acts like a teacher, grading whether the student has produced work comparable to the paintings they are attempting to mimic. As the generator becomes better at fooling the discriminator, its ouput becomes closer and closer to the historical examples it is designed to copy.

There are many classes of GAN models, with additional variants covered in Chapter7, Style Transfer with GANs, and Chapter 11. Composing Music with Generative Models, in our discussion of advanced models. Another key innovation in generative models is in the domain of natural language data. By representing the complex interrelationship between words ina sentence in a computationally scalable way the Transformer network and the Bidirectional Encoder from Trasformers(BERT) model built on top of it present powerful building block to generate textual data in applications such as chatbots, which we'll cover in more detail in Chapter 9, The Rise of Methods for Text Generation, and Chapter 10, NLP2.0: Using Transformers to Generate Text

In Chapter 12, Play Video Games with Generative AI: GAIL, you will also see how models such as GANs and VAEs can be used to generate not just images or text, but sets of rules that allow game-playuing networks developed with reinforcement learning algorithms to process and navigate their enviroment more efficientlyin essence, learning to learn. Generative models are a huge field or research that is constantly growing, so unfortunately, we can't cover every topic in this book. For the interested reader, references to further topics will be provided in Chapter 13, Emerging Applications in Generative AI.

To get started with come background information, let's discuss the rule of probability.

Discriminative and generative models

Theses other example of AI differ in an important way from the model that generated The Portrait of Edmond Belamy. In all of thes other applications, the model is presented with a set of inputs-data such as English text, imatges from X0rays, or the positions on a gameboard- that is paired with a target output, such as the next word in a translated sentence, the diagnostic classification of an X-ray, or the next move in a game. Indeed, this is probaly the kind of AI model you are most familiar with from prior expreiences of predictive modelling; they are broadly knows as discriminative models, shose purpose is th create a mapping between a set of input variables and a target output. The target output could be a set of discrete classes(such as which word in the Englkish language appears next in a translation), or a continuous outcome(such as the expected amount of money a cuatomer will spend in a online store over the next 12 months).

In should be noted that this kind of model, in which data is labeled or scored, represents only half the capabilities of modern machine learning. Another class of algorithms, such as the one that generated the artificial portrait sold at Christie's, don't compute a score or label from input variables, but rather generate new data. Unlike discriminative models, the input variables are often vectors of numgbers that aren't related to real-world values at all, and are often even randomly generated.

This kind of model-known as a generative model-can produce complex outputs such as text, music, or images from random noise, and is the topic of this book.

Even if you didn't know it at the time, you have probably seen other instances of generative models in the news alongside the discriminative example given earlier.

A prominent example is deep fakes, whcih are videos in which one person's face has been systematically replaced with another's by using a neural network to remap the pixels.

Maybe you have also seen stories about AI models that generate fake news, which scientists at the firm OpenAI were initially terrified to release to the public due to concerns they could be used to create propaganda and misinformation online.

In these and other applications, such as Google's voice assistant Duplex, which can make a restaurant reservation by dynamically creating a conversation with a human in real time, or software that can generate original musical compositions, we are surrounded by the outputs of generative AI algorithms.

These models are able to handle complex information in a variety of domains: creating photorealistic images or styleistic filters on pictures(Figure 1.4), synthetic sound, conversational text, and even rules for optimally playing video games. You might ask, where did these models come from? How can I implement them myself?

We will discuss more on that in the next section.

Applicatioins of AI

In New York City in October 2018, the inernational auction house Christies's sold the Portrait of Edmond Belamy(Figure 1.1) during the show Prints & Multiples for $432,500.00. This sale was remarkable both because the sale price was 45 times higher than the initial estimates for the piece, and due to the unusual origin of this portrait. Unlike the majority of other artworks sold by Christie's since the 18th century, the Portrait of Edmond Belamy is not painted using oil or watercolors, nor is its creator even human; rather, it is an entirely digital image produced by a sophisticated maching learning algorithm. The creators - a Paris-based collective named Obvious- used a collection of 15,000 portraits created between the 14th and 20th centuries to tune an artificial neural network model capable of generating aesthetically similar, albeit synthetic, images.

Portraiture is far from the only area in which machine learning has demonstrated astonishing results. Indeed, it you have paid attention to the news inthe last few years, you have likely seen many stories about the ground-breaking results of modern ALsystems applied to diverse problems, from the hard sciences to digital art.

Deep neural network models, such as the one created by Obvious, can now classify X-ray images of human anatomy on the level of trained physicians, beat human masters at both classic board games such as Go(an Asian game similar to chess) and multiplayer computer games, and translate French into English with amazing sensitivily to grammatical nuances.

1. An Introduction to Generative AI:"Drawing" Data from Models

In this chapter, we will dive into the various applications of generative models.

Before that, we will take a step back and examine how exactly generative models are different from other types of machine learning. The difference lies with the basic units of any machine learning algorithm: probability and the various ways we use mathematics to quantify the shape and distribution of data we encounter in the world.

In the rest of this chapter, we will cover:

- Application of AI

- Discriminative and generative models

- Implementing generative models

- The rules of probability

- Why use generative models?

- Unique challenges of generative models

2022년 2월 1일 화요일

1.3 MODELS OF A NEURON

A neuron is an information-processing unit that is fundamental to the operation of a neural network. The block diagram of Fig. 1.5 shows the model of a neuron, which forms the basis for designing (artificial)neural networks. Here we identify three basic elements of the neuronal model:

1. A set of synapses or connecting links, each of which is characterized by a weight or strength of its own. Specifically, a signal xj at the input of synapes j connected to neuron k is multiplied by the synaptic weight wkj. It is important to make a note of the manner in which the subscripts of the synaptic weight wkj are written; the first subscript refers to the neuron in question and the second subscript refers to the input end of the synapse to which the weight refers. Unlike a synapse in the brain, the synaptic weight of an artificial neuron may lie in a range that includes negative as well as positive values.