Machine Learning Approach for Cloud Data Analytics in IoT. Группа авторов
Читать онлайн книгу.a model tries to predict sales, then error rate will be difference in predicted sale and actual sale.
Figure 3.3 General framework of proposed model for predictive data analytics.
Random forest regression may also be employed for prediction problems as it performs classification and regression. Random forest regression employs some classification criteria to classify data. Thereafter, qualities of this split are measured using mean squared error or mean absolute error. It employs the concept of averaging to improve accuracy of prediction.
Authors in the chapter propose usage of bootstrap aggregating ML algorithm also referred to as bagging algorithm. Bagging algorithm aims to improve efficiency and accuracy of ML algorithms by reducing the variance. Usage of bagging algorithm advocates achievement of efficient and accurate predictive model. The accuracy of proposed model increases rapidly over time.
3.4.1 Case Study
For the sake of illustration of implementation of AI in retail industry, authors in the chapter consider a case study. Similarly, authors have taken a dataset pertaining to a retail store. This dataset comprises of observation for duration of 4 years from 2011 to 2015. This dataset has been taken from kaggle (https://www.kaggle.com/jr2ngb/superstore-datausername:jr2ngb). The considered dataset has 16 variables. Out of these 16 features, 10 are categorical features, 5 are numerical features, and 1 is date feature as follows.
# | Feature Name | Non-Null | Dtype |
---|---|---|---|
--- | --------------- | ----------- | ------- |
0 | Order Date | 51290 | datetime64[ns] |
1 | Customer_Name | 51290 | object |
2 | Segment | 51290 | object |
3 | City | 51290 | object |
4 | State | 51290 | object |
5 | Country | 51290 | object |
6 | Category | 51290 | object |
7 | Sub-Category | 51290 | object |
8 | Product Name | 51290 | object |
9 | Sales | 51290 | float64 |
10 | Quantity | 51290 | int64 |
11 | Discount | 51290 | float64 |
12 | Profit | 51290 | float64 |
13 | year | 51290 | int64 |
14 | month | 51290 | int64 |
15 | Day | 51290 | object |
The number of observations in the considered dataset is 51,290. The considered retail store broadly deals in three types of products, viz., office supplies, technology, and furniture.
First of all, authors attempt to understand the correlation among various features of the dataset. Similarly, authors employ Pearson’s correlation that signifies the measure of correlation between two variables. The value lies between −1 and +1. Here, negative value indicates negative linear correlation; 0 signifies no correlation and +1 indicates the positive linear correlation. The Pearson’s correlation among various attributes of the dataset is shown in Figure 3.4.
Further, authors would like to demonstrate how this dataset can be used to understand its chunk of customers across the country. This helps retailer to understand that its largest market share lies in the country and thus enables it to focus in the weaker market section. It can be performed by region-wise analysis as shown in Figure 3.5. The figure shows the histogram plot for frequency of customers across various states in India. From Figure 3.5, it is evident that Maharashtra has the highest number of customers in the country followed by the Uttar Pradesh. On the contrary, places like Manipur, Tripura, Chandigarh, and Pondicherry have the lowest number of customers.
The analysis can further be drilled down to find best and worst performing city in a state so as to exactly identify the specific region or branch. Such drilled down histogram is shown in Figure 3.6. For Maharashtra, it shows that the top performing cities in the state are Mumbai, Pune, Thane, and Nagpur.
Figure 3.4 Pearson’s correlation among various attributes of dataset.
Figure 3.5 Histogram plot for the frequency of customers in country level (India).
Further, it is evident from above two graphs that Mumbai has the highest number of customers. Hence, further the retailer is interested to find which the best performing product in the city is. Therefore, retailer is interested to find the histogram along the product dimension. Similarly, it is evident that office supply category is the most in the city as shown in Figure 3.7.