페이지

2022년 5월 16일 월요일

1.3.1 Data Volume

 Early machine learning algorithms are relatively simple and fast to train, and the size of the required dataset is relatively small, such as the Iris flower dataset collected by the Brithish statistician Ronald Fisher in 1936, which contains only three categories of flowers, with each category having 50 samples, With the development of computer technology, the designed algorithms are more and more complex, and the demand for data volume is also increasing. The MNIST handwritten digital picture dataset collected by Yann LeCun in 1998 contains a total of ten categories of numbers from 0 to 9, with up to 7,000 pictures in each category. With the rise of neural networks, especially deep learning networks, the number of network layers is generally large, and the number of model parameters can reach one million, ten million, or even one billion. To prevent overfitting, the isze of the training dataset is usually huge. The popularity of modern social media also makes it possible to collect huge amounts of data. For example, the ImageNet dataset released in 2010 included a toal of 14,197,122 pictures, and the compressed file size of the entire dataset was 154GB, Figures 1-10 and 1-11 list the number of samples and the size of the data set over time.

Although deep learning has a high demand for large datasets, collecting data, especially collecting labeled data, is often very expensive. The formation of dataset usually requires manual collection, crawling of raw data and cleaning out invalid samples, and then annotating the data samples with human intelligence, so subjective bias and random errors are inevitably introduced. Therefore, algorithms with small data volume requirement are very hot topics.


댓글 없음: