Wiki source code of Tutoriel Prédictif

Last modified by jhurst on 2021/04/21 10:01

Show last authors
1 Predictive in DigDash Enterprise is present through three features :
2
3 1. Predictive measures
4 1. What-if measures
5 1. Smart grouping (clustering)
6
7 This document illustrates a possible use of these tools.
8
9 (% class="box warningmessage" %)
10 (((
11 Warning: predictive measures and what-if measure only work when the cube is processed server side.
12 )))
13
14 {{ddtoc/}}
15
16 = Predictive measures =
17
18 == Creation of a predictive measure ==
19
20 You can create **predictive measures** based on measures and derived measures. Predictive measures enable to predict the value of a given measure according to a temporal dimension.
21
22 (% class="box warningmessage" %)
23 (((
24 Warning: predictive measures work only when the cube is processed server side.
25 )))
26
27 To create a predictive measure, click on the arrow of the drop down menu located next to the button to add a **derived measure**. In the menu which just showed up, select **Create a predictive measure … **The dialog box **Predictive measure** shows up:
28
29 [[image:predictive_tutorial_en_html_2265a9181188016b.png||height="269" width="554"]]
30 \\
31
32 Enter the name of your predictive measure.
33
34 In the **Columns** group you need to complete:
35
36 * The measure you want to predict
37 * The temporal dimension you want to explore in order to make the prediction
38 * The aggregation of this measure
39
40 In the **Model type** group you need to complete:
41
42 * The algorithm to use in order to predict the selected measure in the **Column** group
43 * The value to predict (value of the measure, lower and higher bounds of confidence interval)
44
45 If the algorithm selected is the **moving average**, then you also need to indicate the number of points on which we calculate the moving average (order, equals to 2 by default).
46
47 For each algorithm, we display a brief description next to the **help** icon located at the bottom left. Moreover, you can check the **Help model choice** button in order to be helped in the choice of your algorithm. This triggers the opening of the **Help model choice** window:
48
49 [[image:predictive_tutorial_en_html_464b020944755ae.png||height="288" width="523"]]
50 \\
51
52 By answering to these two questions, we reduce the number of available algorithms in order to make the choice easier.
53
54 Moreover, above the **OK** button in the **Predictive measure** window, you have the possibility to extend the temporal dimension selected in the **Column** group.
55
56 == Available models and model choice ==
57
58 DigDash Enterprise offers 9 prediction models that we present in the table below:
59
60 [[image:predictive_tutorial_en_html_e97cfe15ce5a6cd9.png||height="245" width="554"]]
61 \\
62
63 DigDash Enterprise also offers a moving average algorithm, but this one can not be considered as a prediction algorithm since it can not predict future values. It enables only to study data from the past.
64
65 The model choice depends on the measure you want to predict and the kind of modelling that you want to do. It is important to ask yourself the following questions :
66
67 * Does my data have a trend ?
68 * Does my data have cycles ? If yes, are they complex ?
69 * Do I want to smooth my data ?
70 * Do I want a simple modelling, easy to visualize, or do I prefer an accurate one, more difficult to understand ?
71
72 Thanks to its help interface (image below), DigDash Enterprise reduces the number of available models by asking the two first questions to the user.
73
74 [[image:predictive_tutorial_en_html_464b020944755ae.png||height="288" width="523"]]
75 \\
76
77 In the next part we present the different models more in details so that the user can understand them more easily.
78
79 == Presentation of the models ==
80
81 The following definitions of the algorithms have to be considered in the context of the predictive measures in DigDash Enterprise.
82
83 __Linear regression :__
84
85 Linear regression is a basic tool of modelling. It research a linear relationship between the measure to predict and the time axis.
86
87 The ideal application situation of this model is when the measure to predict is proportional to the time axis. However, it may be interesting to choose this model for its simplicity of visualization (a line), which makes it easily understandable by a large audience.
88
89 Example 1 :
90
91 We want to model the closing price of the Google share over time.
92
93 [[image:predictive_tutorial_en_html_21f30c92fdca7d46.png||height="277" width="554"]]
94 \\
95
96 We can see that the forecast is not really accurate in this case. However, it enables to visualize easily the evolution of the trend of the curve.
97
98 Example 2 :
99
100 We want to model the evolution of the turnover of a company over time.
101
102 [[image:predictive_tutorial_en_html_cb8d5ededb752504.png||height="300" width="554"]]
103 \\
104
105 In this example we are in the best case of application for the linear regression. Indeed, there is a strong linear relationship between the turnover and the time axis.
106
107 __Logarithmic regression:__
108
109 The logarithmic regression has the same properties as the linear regression. The difference stands in the fact that it enables to find a logarithmic relationship between the measure to predict and the time axis.
110
111 Example :
112
113 [[image:predictive_tutorial_en_html_97aba1e3dac31552.png||height="285" width="554"]]
114 \\
115
116 __Polynomial regression__ :
117
118 The polynomial regression is a more complex form of the linear regression. It enables to approximate a measure not by a line, but by a polynomial of order 2 or 3.
119
120 Example :
121
122 [[image:predictive_tutorial_en_html_32c98e09aaa684c7.png||height="296" width="554"]]
123 \\
124
125 __Moving average:__
126
127 The goal of the moving average is not to predict but to smooth the data in order to eliminate the least significant fluctuations. A moving average of order 3 for example is a sliding average which for each point **p**, compute the average of points **p-1**, **p** and **p+1**. In this case, the computation of the moving average is possible only from the second point up to the penultimate one.
128
129 Example :
130
131 [[image:predictive_tutorial_en_html_2dd1b375a1f84c00.png||height="292" width="554"]]
132 \\
133
134 We can see that the moving average of order 3 enabled to smooth the curve of water consumption. It deleted the brutal variations such as in 1953 so that the user can focus on the overall shape of the curve.
135
136 __Simple exponential smoothing :__
137
138 In contrary to classical regression techniques which are not specific to time series, smoothing takes into account the specificity of the temporal variable. Indeed, the importance to give to a value decreases over time. For example, in order to predict the turnover for year 2017, it is more likely that we give more importance to the value of the turnover in 2016 than the one of 2008. The various techniques of smoothing enable to take into account the depreciation of the information over time.
139
140 The simple exponential smoothing enables to smooth the data but also to predict the next value. It applies to data showing no trend or cycle.
141
142 Example :
143
144 We want to predict the communication cost over time. We remark that the data are quite chaotic : they have no trend or cycle. This is a situation where the prediction is difficult. That is the reason why the exponential smoothing predicts only the next value.
145
146
147
148 [[image:predictive_tutorial_en_html_8cb67f915d976636.png||height="253" width="554"]]
149 \\
150
151 __Double exponential smoothing :__
152
153 The double exponential smoothing is an improved version of the simple exponential smoothing because this one is able to take into account a trend in the data. However, it does not enable to predict cyclic data.
154
155 Example:
156
157 We want to predict the number of commands of a provider over time. By observing our data we remark that there is trend, but no seasonality. We meet the application conditions for the double exponential smoothing.
158
159 [[image:predictive_tutorial_en_html_621529e350cc6377.png||height="301" width="554"]]
160 \\
161
162 __Holt smoothing:__
163
164 Holt smoothing is an improved version of the double exponential smoothing. It uses 2 parameters instead of one.
165
166
167 Example :
168
169 [[image:predictive_tutorial_en_html_252d1e51cea2571a.png||height="253" width="554"]]
170
171
172 __Holt-Winters smoothing :__
173
174 Holt-Winters smoothing enables to take into account data with a trend and a seasonality. DigDash Enterprise offers two versions of the Holt-Winters smoothing :
175
176 1. Additive version
177 1. Multiplicative version
178
179 The additive case corresponds to seasons whose amplitude remains constant over time (see image below) :
180
181 [[image:predictive_tutorial_en_html_d7e10e7b78cccfc8.png||height="276" width="554"]]
182
183
184 The multiplicative case corresponds to situations where the amplitude of seasons varies over time (see image below) :
185
186
187 [[image:predictive_tutorial_en_html_987d2ad27b73685.png||height="297" width="554"]]
188 Contrary to the first example (additive case), we can see that the amplitude of the cycles increases over time. Hence the need to use a multiplicative model instead of an additive one.
189
190 __Fourier transform :__
191
192 The algorithm use to predict is not the Fourier transform. It is an algorithm based on it. Technically, the goal is to split our measure to predict into a sum of sine and cosine functions of different periods. Thus, our algorithm is able to take into account more complex cycles/seasons than the ones presented in the Holt-Winters part.
193
194 Example :
195
196 For this example we want to predict the number of houses sold by a real estate agency. By observing the data we can see a complex shape of cyclicity. Hence the need to use our algorithm based on Fourier transform instead of a model based on the Holt-Winters algorithm.
197
198 [[image:predictive_tutorial_en_html_3f6bcfd1b18ffa1a.png||height="298" width="554"]]
199
200
201 == Values to predict ==
202
203 We can predict 3 kinds of values through the predictive measures :
204
205 1. The value of the measure
206 1. The lower bound of the 95 % confidence interval
207 1. The higher bound of the 95 % confidence interval
208
209 The confidence interval is defined so that the predicted value has a 95 % probability to be between its two bounds.
210
211 Be careful, this result is valid only when data are normal (Gaussian meaning). However, it can still be interesting to plot these intervals even when this hypothesis is not respected.
212
213 Example:
214
215 We apply the confidence interval on our first example of linear regression.
216
217 [[image:predictive_tutorial_en_html_66334faff9557ca5.png||height="287" width="554"]]
218
219
220 == Mathematical considerations ==
221
222 In the case of algorithms which don't have exact mathematical solutions, the parameters are estimated by minimizing the sum of squared errors.
223
224 The number of iterations to estimate these parameters can't be yet modified by the user.
225
226 Moreover, it is important to understand that the algorithms presented previously use data from the past in order to predict future values. It implies several points :
227
228 * The more there will be data, the better the prediction
229 * Algorithms are based on the past, they consider that the future will be similar. Therefore, it won't be possible to predict structural shocks
230 * The accuracy of the prediction can't be guaranteed and must not be considered as deterministic
231
232 = WHAT-IF MEASUREs =
233
234 A what-if measure has for goal to study the influence of several measures on a given measure.
235
236 == Creation of a what-if measure ==
237
238 You can create What-if measures based on measures and derived measures.
239
240 Warning: What-if measures work only when the cube is processed server side
241
242 To create a What-if measure click on the arrow of the drop down menu located next to the button to add a **derived measure**. In the menu which just showed up, select **Create a What-if measure … **The dialog box **What-if Analyzis** shows up.
243
244 [[image:predictive_tutorial_en_html_ed194e5c9e201179.png||height="251" width="554"]]
245
246
247 Enter the name of your What-if measure.
248
249 In the **Measure to analyze** group you need to select:
250
251 * The measure you want to analyze
252 * The aggregation of this measure
253
254 In the Dependent measures group you need to select the measures you want to integrate to your model in order to model the measure to analyze. For each measure you need to select:
255
256 1. The aggregation of the measure
257 1. The variable associated to this measure. If you don't want to change the value of the measure, select **None**
258
259 The other fields of the **Dependent measures** group are not required. However, we are going to explain their usefulness.
260
261 The **Automatic model** button enables to automatically select the measures in order to model the measure to analyze. To use it, you need to select the dimension(s) according to which you want to explore the measures to determine the model.
262
263 The **Compute quality of the model** button enables to compute the adjusted R² coefficient. It computes the accuracy of the model but also takes into account the complexity of this one. To compute it, you need to select the measures of your model and their aggregation.
264
265 == Example ==
266
267 Our example will be on a supermarket chain. We have data on each store :
268
269 * Annual net sales
270 * Surface
271 * Stocks
272 * Advertising budget
273 * Number of families in the area
274 * Number of competitors in the area
275
276 We want to study the effect of the advertising budget on the annual net sales. To do it we are going to create a what-if measure by completing the fields like on the picture below :
277
278 [[image:predictive_tutorial_en_html_889f2aa21b3aebdf.png||height="251" width="554"]]
279
280
281 We have selected all the measures because we want to use them all to model the annual nets sales of our stores. However, since we only want to test the influence of the advertising budget, we let the value **Associated variable** equals to None for all of the measures except for **k$ spend on ads**.
282
283 The associated variable to the advertising budget is the variable **var ad budget**. This variable is a classical variable such as the one we can use in a derived measure.
284
285 Once the what-if measure has been created, we can realize a bar plot on which we set the measures **annual net sales (k$)** and our new what-if measure, grouped by stores. In the dashboard we add our variable **var ad budget** so that we can change its value and thus, observe its influence on **Estimation net sales (ad budget)**.
286
287 [[image:predictive_tutorial_en_html_796771fc10f44f4c.png||height="216" width="554"]]
288
289
290 == Mathematical model ==
291
292 What-if measures use the multiple linear regression to model the relationship between the measure to analyze and the dependent measures. It means that DigDash Enterprise is not able to automatically model measures whose relationship is not linear. However, the user can still manually applies transformations on the data by using a data transformer or by creating a derived measure.
293
294 Moreover, such as for the predictive measures, it is important to understand that the algorithms presented previously use data from the past in order to predict future values. It implies several points :
295
296 * The more there will be data, the better the prediction
297 * Algorithms are based on the past, they consider that the future will be similar. Therefore, it won't be possible to predict structural shocks
298 * The accuracy of the prediction can't be guaranteed and must not be considered as deterministic
299
300 = Smart grouping (clustering) =
301
302 == Creation of a smart grouping ==
303
304 Smart grouping (clustering) enables to group members of a dimension according to several measures. The goal here is to group “similar” members together. The similarity notion has to be interpreted in the mathematical way. It means members whose values are close for given measures.
305
306 In the Hierarchies section, click Add then Smart grouping. The Smart grouping editor dialog box shows up:
307
308 [[image:predictive_tutorial_en_html_75a563b04bc9ee46.png||height="315" width="554"]]
309
310
311 The upper part of the window is similar to the one to create a manual hierarchy. We can define the identifier of the hierarchy (by default Group 0), add levels, and changer their identifiers.
312
313 The differences are in the **Level definition** and **Details groups**.
314
315 In the **Level definition** group we define the measures according to which we want to group the members of the current level. You need to select the measures and for each of these measures, select their aggregation.
316
317 The **Details** group enables to precise:
318
319 1. The number of clusters (groups) of the current level
320 1. The name of these clusters
321
322 The name of these clusters can be composed of several keywords that we add thanks to the **Add keyword** button located above the **Cancel** button.
323
324 The keywords available to name the clusters are the following:
325
326 * **${bestMeasures}** which enables to name the clusters according to the name of the 3 most discriminative measures. Each cluster name has the form **Measure1Position,Measure2Position,Measure3Position** where the position is between 0 and the number of clusters of the level, and indicates how the mean of the measure is ranked compare to the other clusters (for more details please see the example at the end of this part).
327 * **${Measure(MeasureName)}** which enables to choose yourself the measures you want to show in the clusters names.
328
329 By default the name of the clusters is equal to **${bestMeasures}**.
330
331 (% class="box warningmessage" %)
332 (((
333 Be careful, it is not possible to combine **${bestMeasures}** with **${Measure(MeasureName)}**.
334 )))
335
336 The user also has the possibility to add personal text in the clusters names.
337
338 To finalize the first step, the user has to click on the **Next** button. This triggers the computation of the groups and brings him to a second window, identical to the one to create a **manual hierarchy** but where the groups are already built.
339
340 Thus, this second step is totally identical to the creation of a manual hierarchy. You strictly have the same possibilities. For further details, please refer to the part on the **creation of a manual hierarchy**.
341
342 == Examples ==
343
344 __Example 1 :__
345
346 We have data about stores of a supermarket chain. We want to create 5 groups of stores, similar in term of advertising budget and annual net sales. We select **${bestMeasures}** to name our groups of stores.
347
348 [[image:predictive_tutorial_en_html_b10c6cf531cabf2c.png||height="317" width="554"]]
349
350
351 By clicking on the next button we get the following result:
352
353 [[image:predictive_tutorial_en_html_214849af462ca32.png||height="315" width="554"]]
354
355
356 We get 5 groups of stores as we asked in the previous step. The group names follows the schema **annual net sales k$Position,k$ spend on adsPosition**.
357
358 For the group **annual net sales k$0,k$ spend on ads0** it means that this group contains stores whose annual net sales and advertising budget is the lowest (rank 0). On the opposite, the group annual net sales k$4,k$ spend on ads4 contains stores whose annual net sales and advertising budget is the highest.
359
360 __Example 2 :__
361
362 We have data about wines :
363
364 * Quality of the wine (average user rating)
365 * Chemical indicators (pH, alcoholic degree, density, etc)
366
367 We want to group our wines in four groups depending on their quality and their alcoholic degree.
368
369 To do it we complete the **Smart grouping** interface such as indicated on the picture below and then we click on the **Next** button :
370
371 [[image:predictive_tutorial_en_html_ec43ac17c6fab1f7.png||height="315" width="554"]]
372 \\
373
374 In the second window we validate the creation of our hierarchy.
375
376 Then, we create a **scatter plot**. On the **Y axis** we set the alcoholic degree. On the **X axis** the quality of the wines. In the **Bubbles **field we set our dimension id wine at the root level. In the **Cycle color** field we set again id wine but this time at level 0 so that each group will have a different color.
377
378 We get the following graph :
379
380 [[image:predictive_tutorial_en_html_95e3ca2d12161461.png||height="278" width="519"]]
381
382
383 == Mathematical considerations ==
384
385 Groups (clusters) are determined using the Kmeans++ algorithm whose result depends on the initialization parameters. Thus, if you try to create smart grouping multiple times (even with the same parameters), you may not always have the same results.