As we well see in Chapter 3, Building Blocks of Deep Neural Networks, a deep neural network in essence consists of matrix operations(addition, subtraction, multiplication), nonlinear transformations, and gradient-based updates computed by using the derivatives of theses components.
In the world of academia, researchers have historically often used efficient prototyping tools such as MATLAB' to run models and prepare analyses. While this approach allows for rapid experimentation, it lacks elements of industrial software development, such as object-oriented(oo) development, that allow for reproducibility and clean software abstractions that allow tools to be adopted by large organizations. These tools also had difficulty scaling to large dataset and could carry heavy licensing fees for such industrial use cases. However, prior to 2006, this type of computational tooling was largely sufficient for most use cases.
However, as the datasets being tackled with deep neural network algorithms grew, groundbreaking results were achieved such as:
- Image classification on the ImageNet dataset
- Large-scale unsupervised discovery of image patterns in YouTube videos
- The creation of artificial agents capable of playing Atari video games and the Asian board game GO with human-like skill
- State-of-the-art language translation via the BERT model developed by Google
The model developed in these studies exploeded in complecity along with the size of the datasets they were applied to (see Table 2.2 to get a sense of the immense scale of some of these models). As industrial use case required robust and scalable frameworks to develop and deploy new neural networks, serveral academic groups and large technology companies invested in the development of generic toolkits for the implementation of deep learning models. These software libraries codified common patterns into reusable abstrations, allowing even complex models to be oftern embodied in relatively simple expreimental scripts.
Model Name Year #Parameters
AlexNet 2012 61M
YouTube CNN 2012 1B
Inception 2014 5M
VCG-16 2014 138M
BERT 2018 340M
GPT-3 2020 175B
Some of early examples of these frameworks include Theano, a Python package developed at the University of Montreal, and Torch, a library written in the Lua language that was later ported to Python by researchers at Facebook, and TensorFlow, a C++ runtime with Python binding developed by Google.
In this book, we will primarily use TensorFlow 2.0, due to its widespread adoption and its convenient high-level interface, Keras, which abstracts much of the repetitive plumbing of defining runtime layers and model architecture.
TensorFlow is an open-source verion of an internal tool developed at Google called DisBelief. The DisBelief framework consisted of distributted workers(independent computaional processes running on a cluster of machines) that would compute forward and backward gradient descent passes on a network(a common way to train neural networks we will discuss in Chapter 3, Building Blocks of Deep Neural Networks), and send the results to a Parameter Server that aggregated the updates. The neural networks in the DisBelief framework were represented as a Directed Acyclic Graph(DAG), termination in a loss function that yielded a scalar(numerical value) comparing the network predictions with the observed target(such as image class of the probability distribution over a vocabulary representing the most probable next word in a sentence in a traslation model).
A DAG is a software data structure consisting of nodes(operations) and data (edges) where information only flows ina single direction along, the edges (thus directed)And where are no loops(hence acyclic).
While DisBelief allowed Google to productionize serveral large models, it had limitations:
- First, the Python scriping interfface was developed with a set of pre-defined layers corresponding to underlying implementations in C++; adding novel layer types required coding, in C++, which represented a barrier to productivity.
- Secondly, while the system was well adapted for training feed-forward networks using, basic Stochastic Gradient Descent(SGD) (an algorithm we will describe in more detail in Chater 3, Building Blocks of Deep Neural Networks) on Large-scale data, it lacked flexibliity for accommodation recurrent, reinforcement learning, or adverarial learning paradigms - the latter of which is crucial to many of the algorithms we will implement in this book.
- Finallyu, this system was difficult to scale down - to run the same job, for example, on a desktop with GOUs as well as a distributed environment with multiple cores per machine, and deployment also required a different technical stack.
Jointly, these considerations prompted the development of TensorFlow as a generic deep learning computational framework: one that could allow scientists to flexibly experiment with new layer architectures or cutting-edge training paradigms, while also allowing this experimentation to be run with the same tools on botha laptop (for early-stage work) and a computing cluster (to scale up more mature models). while also easing the transition between research and development code by providing a common runtime for both.
Though both libraries share the concept of the computation graph (networks represented as a graph of operations (nodes)and data (edges)) and a dataflow programming model (where matrix operations pass through the directed edges of a graph and have operations applied to them), TensorFlow, unlike DistBelief, was designed with the edges of the graph being tensors (n-dimensional matrices) and nodes of the graph being atomic operations (addition, subtraction, nonlinear operations - this allows for much greater flexibility in defining new computations and even allowing for mutation and stateful updates (these being simple additional nodes in the graph).
The dataflow graph in essence servers as a "placeholder" where data is slotted into defined variables and can be executed on single or multiple machine. TensorFlow optimizes the constructed dataflow graph in the C++ runtime upon execution, allowing optimization, for example, in issuing commands to the GPU, The different computations of the graph can also be executed across multiple machines and hardware, including, CPUs, GPUs, and TPUs (custom tensor processing chips developed by Google and available in the Google Cloud computing environment), as the same computations described at a high level in TensorFlow are implemented to execute on multiple backend system.
Because the dataflow graph allows mutable state, in essence, there is also no longer a centralized parameter server as was the case for DisBelief (though TensorFlow can also be run in a distributed manner with a parameter server configuration), since different nodes that hold state can execute the same opertions as any other worker nodes. Further, countrol flow operations such as loops allow for the training of variable-length inputs such as in recurrent networks (see Chapter 3, Building Blocks of Deep Neural Networks). In the context of training neural networks, the gradients of each layer are simply represented as additional operations in the graph, allowing optimizations such as velocity (as in the RMSProp or ADAM optimizers, described in Chapter 3, Building Blocks of Deep Neural Networks) to be included using the same framework rather than modifyingg the parameter server logic. In the context of distributed training, TensorFlow also has several checkpointing and redundancy mechanisms("backup" workers in case of a single task failure) that make it suited to robust training in distributed environments.