페이지

2022년 5월 17일 화요일

1.4.2 Natural Language Processing

 Machine Translation. In the past, machine translation algorithms were usually based on statistical machine translation models, which were also the technology used by Google's translation system before 2016. In November 2016, Google launched the Google Neural Machine Translation(GNMT) system based on the Seq2Seq model. For the first time, the direct translation technology from source lanuage to target language was realized with 50~90% improvement on multiple tasks. Commonly used machine translation models are Seq2Seq, BERT, GPT, and GPT-2, Among them, the GPT-2 model proposed by OpenAI has about 1.5 billion parameters. At the begining, OpenAI refused to open-source the GPT-2 model due to technical security reasons. Chatbot is also a mainstream task of natural language processing. Machines automatically learn to talk to humans, provide satisfactory automatic responses to simple human demands, and improve customer service efficiency and service quality. Chatbot is often used in consulting esystems, entertrainment systems, and smart homes.

1.4.1 Computer Vision

 Image classification is a common classification problem. The input of the neural network is pictures, and the output value is the probability that the current sample belongs to each category. Generally, the category with the highest probability is selected as the predicted category of the sample.

Image recognition is one of the earliest successful applications of deep learning. Classic neural network models include VGG series, Inception series, and ResNet series.

Object detection refers to the automatic detection of the approximate locationof common objects in a picture by an algorithm. It is usually represented by a bounding box and classifies the category information of objects in the bounding box, as shown in Figure 1-15. Common object detection algorithms are RCNN, Fast RCNN, Faster RCNN, Mask RCNN, SDD, and YOLO eries.

Semantic segmentation is an algorithm to automatically segment and identify the content in a picture. We can understand semantic segmentation as the classification of each pixel and analyze the category information of each pixel, as shown in Figure 1-16. Common semantic segmentationi models include FCN, U-net, SegNet, and DeepLab series.

Video Understanding. As Deep learning achieves better result on 2D picture-related tasks, 3D video understanding tasks with temporal dimention information (the third dimention is sequence of frames) are receiving more and more attention. Common video understanding tasks include video classification, behavior detection, and video subject extraction. Common models are C3D, TSN, DOVF, and TS_LSTM.

Image generation learns the distribution of real pictures and samples from the learned distribution to obtain highly realistic generated pictures. At present, common image generation models include VAE series and GAN series. Among them, the GAN series of algorithms have made great progress in recent years. The picture effect produced by the latest GAN model has reached a level where it is difficult to distingush the authenticity with the naked eye, as shown in Figure 1-17.

In addition to the preceding applications, deep learning has also achieved significant results in other areas, such as artistic style transfer(Figure 1-18), super-resolution, picture de-nosing/hazing, grayscale picture coloring and many others.


1.4 Deep Learning Applications

 Deep learning algorithms have been widely used in our daily life, such as vocie assistants in mobile phones, intelligent assisted driving in cars, and face payments. We will introduce some mainstream applications of deep learning starting with computer vision, natural language processing, and reinforcement learning.

1.3.4 General Intelligence

 In the past, in order to improve the performance of an algorithm on a certain task, it is often necessary to use prior knowledge to manually design corresponding features to help the algorithm better converge to the optimal solution. This type of feature extraction method is often strongly related to the specific task. Once the scenario changes, these artificially designed features and prior settings cannot adapt to the new scenario, and people often need to redesign the algorithms.

Designing a universal intelligent mechanism that can automatically learn and self-adjust like the human brain has always been the common vision of human beings. Deep learning is one of the algorithms closest to general intelligence. In the computer vision field, previous methods that need to desing features for specific tasks and add a priori assumptions have been abandoned by deep learning algorithms. At present, almost all algorithms in image recognition, object detection, and semantic segmentation are based on end-to-end deep learning models, which present good performance and strong adaptability. On the Atari game platform, the DQN algorithm designed by DeepMind can reach human equivalent level in 49 games under the same algorithm, model structure, and hyperparameter settings, showing a certain degree of general intelligence. Figure 1-14 is the network structure of the DQN algorithm. It is not designed for a certain game but can control 49 games on the Atari game platform

2022년 5월 16일 월요일

1.3.3 Network Scale

 Early perceptron models and multilayer neural networks only have one or two to four layers, and the network parameters are also around tens of thousands. With the development of deep learning and the improvement of computing capabilities, models such as AlexNet(8layers), VGG16(16 layers), GoogleNet(22 layers), REsNet50(50 layers), and DenseNet121(121 layers) have been proposed successively, while the size of inputting pictures has also gradually increased from 28 * 28 to 244 * 244 to 299 * 299 and even alrger. These changes make the total number of parameters of the network reach ten million levels, as shown in Figure 1-13.

The increase of network scale enhances the capacity of the neural networks correspondingly, so that the networks can learn more complex data modalities and the model performance can be improved accordingly. On the other hand, the increase of the network scale also means that we need more training data and computational power to avoid overfitting.

1.3.2 Computing Power

 The increase in computing power is an important factor in the third artificial intelligence renaissance. In fact, the basic theory of modern deep learning was proposed in the 1980s, but the real potential of deep learning was not realized until the release of AlexNet based on training on two GTX580 GPUs in 2012. Traditional machine learning algorithms do not have stringent requirements on data volume and computing power like deep learning. Usually, serial training on CPU can get satisfactory results. But deep elarning relies heavily on parallel acceleration computing devices. Most of current neural networks use parallel acceleration chips such as NVIDIA GPU and Google TPU to train model parameters. For example, the AlphaGo Zero program needs to be trained on 64 GPUs from scratch for 40 days before surpassing all AlphaGo historical versions. The automatic network structure search algorithm used 800 GPU s to optimize a better network strtucture.

At present, the deep elarnign acceleration hardware device that ordinary consumers can sue are mainly from NVIDIA GPU from 2008 to 2017. It can be seen that the curve of x86 CPU changes relatively slowly, and the floating-point computing capacity of NVIDIA GPU grows exponentially which is mainly driven by the increasing business of game and deep learning computing.

1.3.1 Data Volume

 Early machine learning algorithms are relatively simple and fast to train, and the size of the required dataset is relatively small, such as the Iris flower dataset collected by the Brithish statistician Ronald Fisher in 1936, which contains only three categories of flowers, with each category having 50 samples, With the development of computer technology, the designed algorithms are more and more complex, and the demand for data volume is also increasing. The MNIST handwritten digital picture dataset collected by Yann LeCun in 1998 contains a total of ten categories of numbers from 0 to 9, with up to 7,000 pictures in each category. With the rise of neural networks, especially deep learning networks, the number of network layers is generally large, and the number of model parameters can reach one million, ten million, or even one billion. To prevent overfitting, the isze of the training dataset is usually huge. The popularity of modern social media also makes it possible to collect huge amounts of data. For example, the ImageNet dataset released in 2010 included a toal of 14,197,122 pictures, and the compressed file size of the entire dataset was 154GB, Figures 1-10 and 1-11 list the number of samples and the size of the data set over time.

Although deep learning has a high demand for large datasets, collecting data, especially collecting labeled data, is often very expensive. The formation of dataset usually requires manual collection, crawling of raw data and cleaning out invalid samples, and then annotating the data samples with human intelligence, so subjective bias and random errors are inevitably introduced. Therefore, algorithms with small data volume requirement are very hot topics.