Predictive tutorial

You can create **predictive measures** based on measures and derived measures. Predictive measures enable to predict the value of a given measure according to a temporal dimension.

21

22

(% class="box warningmessage" %)

23

(((

24

Warning: predictive measures work only when the cube is processed server side.

25

)))

26

27

To create a predictive measure, click on the arrow of the drop down menu located next to the button to add a **derived measure**. In the menu which just showed up, select **Create a predictive measure … **The dialog box **Predictive measure** shows up:

28

29

[[image:predictive_tutorial_en_html_2265a9181188016b.png||height="269" width="554"]]

30

\\

31

32

Enter the name of your predictive measure.

33

34

In the **Columns** group you need to complete:

35

36

* The measure you want to predict

37

* The temporal dimension you want to explore in order to make the prediction

38

* The aggregation of this measure

39

40

In the **Model type** group you need to complete:

41

42

* The algorithm to use in order to predict the selected measure in the **Column** group

43

* The value to predict (value of the measure, lower and higher bounds of confidence interval)

44

45

If the algorithm selected is the **moving average**, then you also need to indicate the number of points on which we calculate the moving average (order, equals to 2 by default).

46

47

For each algorithm, we display a brief description next to the **help** icon located at the bottom left. Moreover, you can check the **Help model choice** button in order to be helped in the choice of your algorithm. This triggers the opening of the **Help model choice** window:

48

49

[[image:predictive_tutorial_en_html_464b020944755ae.png||height="288" width="523"]]

50

\\

51

52

By answering to these two questions, we reduce the number of available algorithms in order to make the choice easier.

53

54

Moreover, above the **OK** button in the **Predictive measure** window, you have the possibility to extend the temporal dimension selected in the **Column** group.

55

56

== Available models and model choice ==

57

58

DigDash Enterprise offers 9 prediction models that we present in the table below:

59

60

[[image:predictive_tutorial_en_html_e97cfe15ce5a6cd9.png||height="245" width="554"]]

61

\\

62

63

DigDash Enterprise also offers a moving average algorithm, but this one can not be considered as a prediction algorithm since it can not predict future values. It enables only to study data from the past.

64

65

The model choice depends on the measure you want to predict and the kind of modelling that you want to do. It is important to ask yourself the following questions :

66

67

* Does my data have a trend ?

68

* Does my data have cycles ? If yes, are they complex ?

69

* Do I want to smooth my data ?

70

* Do I want a simple modelling, easy to visualize, or do I prefer an accurate one, more difficult to understand ?

71

72

Thanks to its help interface (image below), DigDash Enterprise reduces the number of available models by asking the two first questions to the user.

73

74

[[image:predictive_tutorial_en_html_464b020944755ae.png||height="288" width="523"]]

75

\\

76

77

In the next part we present the different models more in details so that the user can understand them more easily.

78

79

== Presentation of the models ==

80

81

The following definitions of the algorithms have to be considered in the context of the predictive measures in DigDash Enterprise.

82

83

__Linear regression :__

84

85

Linear regression is a basic tool of modelling. It research a linear relationship between the measure to predict and the time axis.

86

87

The ideal application situation of this model is when the measure to predict is proportional to the time axis. However, it may be interesting to choose this model for its simplicity of visualization (a line), which makes it easily understandable by a large audience.

Example 1 :

We want to model the closing price of the Google share over time.

92

93

[[image:predictive_tutorial_en_html_21f30c92fdca7d46.png||height="277" width="554"]]

94

\\

95

96

We can see that the forecast is not really accurate in this case. However, it enables to visualize easily the evolution of the trend of the curve.

Example 2 :

We want to model the evolution of the turnover of a company over time.

101

102

[[image:predictive_tutorial_en_html_cb8d5ededb752504.png||height="300" width="554"]]

103

\\

104

105

In this example we are in the best case of application for the linear regression. Indeed, there is a strong linear relationship between the turnover and the time axis.

106

107

__Logarithmic regression:__

108

109

The logarithmic regression has the same properties as the linear regression. The difference stands in the fact that it enables to find a logarithmic relationship between the measure to predict and the time axis.

Example :

[[image:predictive_tutorial_en_html_97aba1e3dac31552.png||height="285" width="554"]]

114

\\

115

116

__Polynomial regression__ :

117

118

The polynomial regression is a more complex form of the linear regression. It enables to approximate a measure not by a line, but by a polynomial of order 2 or 3.

Example :

[[image:predictive_tutorial_en_html_32c98e09aaa684c7.png||height="296" width="554"]]

\\

__Moving average:__

The goal of the moving average is not to predict but to smooth the data in order to eliminate the least significant fluctuations. A moving average of order 3 for example is a sliding average which for each point **p**, compute the average of points **p-1**, **p** and **p+1**. In this case, the computation of the moving average is possible only from the second point up to the penultimate one.

Example :

[[image:predictive_tutorial_en_html_2dd1b375a1f84c00.png||height="292" width="554"]]

132

\\

133

134

We can see that the moving average of order 3 enabled to smooth the curve of water consumption. It deleted the brutal variations such as in 1953 so that the user can focus on the overall shape of the curve.

135

136

__Simple exponential smoothing :__

137

138

In contrary to classical regression techniques which are not specific to time series, smoothing takes into account the specificity of the temporal variable. Indeed, the importance to give to a value decreases over time. For example, in order to predict the turnover for year 2017, it is more likely that we give more importance to the value of the turnover in 2016 than the one of 2008. The various techniques of smoothing enable to take into account the depreciation of the information over time.

139

140

The simple exponential smoothing enables to smooth the data but also to predict the next value. It applies to data showing no trend or cycle.

Example :

We want to predict the communication cost over time. We remark that the data are quite chaotic : they have no trend or cycle. This is a situation where the prediction is difficult. That is the reason why the exponential smoothing predicts only the next value.

[[image:predictive_tutorial_en_html_8cb67f915d976636.png||height="253" width="554"]]

149

\\

150

151

__Double exponential smoothing :__

152

153

The double exponential smoothing is an improved version of the simple exponential smoothing because this one is able to take into account a trend in the data. However, it does not enable to predict cyclic data.

Example:

We want to predict the number of commands of a provider over time. By observing our data we remark that there is trend, but no seasonality. We meet the application conditions for the double exponential smoothing.

158

159

[[image:predictive_tutorial_en_html_621529e350cc6377.png||height="301" width="554"]]

\\

__Holt smoothing:__

Holt smoothing is an improved version of the double exponential smoothing. It uses 2 parameters instead of one.

Example :

[[image:predictive_tutorial_en_html_252d1e51cea2571a.png||height="253" width="554"]]

170

171

172

__Holt-Winters smoothing :__

173

174

Holt-Winters smoothing enables to take into account data with a trend and a seasonality. DigDash Enterprise offers two versions of the Holt-Winters smoothing :

175

176

1. Additive version

177

1. Multiplicative version

178

179

The additive case corresponds to seasons whose amplitude remains constant over time (see image below) :

180

181

[[image:predictive_tutorial_en_html_d7e10e7b78cccfc8.png||height="276" width="554"]]

182

183

184

The multiplicative case corresponds to situations where the amplitude of seasons varies over time (see image below) :

185

186

187

[[image:predictive_tutorial_en_html_987d2ad27b73685.png||height="297" width="554"]]

188

Contrary to the first example (additive case), we can see that the amplitude of the cycles increases over time. Hence the need to use a multiplicative model instead of an additive one.

189

190

__Fourier transform :__

191

192

The algorithm use to predict is not the Fourier transform. It is an algorithm based on it. Technically, the goal is to split our measure to predict into a sum of sine and cosine functions of different periods. Thus, our algorithm is able to take into account more complex cycles/seasons than the ones presented in the Holt-Winters part.

Example :

For this example we want to predict the number of houses sold by a real estate agency. By observing the data we can see a complex shape of cyclicity. Hence the need to use our algorithm based on Fourier transform instead of a model based on the Holt-Winters algorithm.

197

198

[[image:predictive_tutorial_en_html_3f6bcfd1b18ffa1a.png||height="298" width="554"]]

199

200

201

== Values to predict ==

202

203

We can predict 3 kinds of values through the predictive measures :

204

205

1. The value of the measure

206

1. The lower bound of the 95 % confidence interval

207

1. The higher bound of the 95 % confidence interval

208

209

The confidence interval is defined so that the predicted value has a 95 % probability to be between its two bounds.

210

211

Be careful, this result is valid only when data are normal (Gaussian meaning). However, it can still be interesting to plot these intervals even when this hypothesis is not respected.

Example:

We apply the confidence interval on our first example of linear regression.

216

217

[[image:predictive_tutorial_en_html_66334faff9557ca5.png||height="287" width="554"]]

218

219

220

== Mathematical considerations ==

221

222

In the case of algorithms which don't have exact mathematical solutions, the parameters are estimated by minimizing the sum of squared errors.

223

224

The number of iterations to estimate these parameters can't be yet modified by the user.

225

226

Moreover, it is important to understand that the algorithms presented previously use data from the past in order to predict future values. It implies several points :

227

228

* The more there will be data, the better the prediction

229

* Algorithms are based on the past, they consider that the future will be similar. Therefore, it won't be possible to predict structural shocks

230

* The accuracy of the prediction can't be guaranteed and must not be considered as deterministic

= WHAT-IF MEASUREs =

A what-if measure has for goal to study the influence of several measures on a given measure.

235

236

== Creation of a what-if measure ==

237

238

You can create What-if measures based on measures and derived measures.

239

240

Warning: What-if measures work only when the cube is processed server side

241

242

To create a What-if measure click on the arrow of the drop down menu located next to the button to add a **derived measure**. In the menu which just showed up, select **Create a What-if measure … **The dialog box **What-if Analyzis** shows up.

243

244

[[image:predictive_tutorial_en_html_ed194e5c9e201179.png||height="251" width="554"]]

245

246

247

Enter the name of your What-if measure.

248

249

In the **Measure to analyze** group you need to select:

250

251

* The measure you want to analyze

252

* The aggregation of this measure

253

254

In the Dependent measures group you need to select the measures you want to integrate to your model in order to model the measure to analyze. For each measure you need to select:

255

256

1. The aggregation of the measure

257

1. The variable associated to this measure. If you don't want to change the value of the measure, select **None**

258

259

The other fields of the **Dependent measures** group are not required. However, we are going to explain their usefulness.

260

261

The **Automatic model** button enables to automatically select the measures in order to model the measure to analyze. To use it, you need to select the dimension(s) according to which you want to explore the measures to determine the model.

262

263

The **Compute quality of the model** button enables to compute the adjusted R² coefficient. It computes the accuracy of the model but also takes into account the complexity of this one. To compute it, you need to select the measures of your model and their aggregation.

== Example ==

Our example will be on a supermarket chain. We have data on each store :

* Annual net sales

* Surface

* Stocks

* Advertising budget

* Number of families in the area

274

* Number of competitors in the area

275

276

We want to study the effect of the advertising budget on the annual net sales. To do it we are going to create a what-if measure by completing the fields like on the picture below :

277

278

[[image:predictive_tutorial_en_html_889f2aa21b3aebdf.png||height="251" width="554"]]

279

280

281

We have selected all the measures because we want to use them all to model the annual nets sales of our stores. However, since we only want to test the influence of the advertising budget, we let the value **Associated variable** equals to None for all of the measures except for **k$ spend on ads**.

282

283

The associated variable to the advertising budget is the variable **var ad budget**. This variable is a classical variable such as the one we can use in a derived measure.

284

285

Once the what-if measure has been created, we can realize a bar plot on which we set the measures **annual net sales (k$)** and our new what-if measure, grouped by stores. In the dashboard we add our variable **var ad budget** so that we can change its value and thus, observe its influence on **Estimation net sales (ad budget)**.

286

287

[[image:predictive_tutorial_en_html_796771fc10f44f4c.png||height="216" width="554"]]

288

289

290

== Mathematical model ==

291

292

What-if measures use the multiple linear regression to model the relationship between the measure to analyze and the dependent measures. It means that DigDash Enterprise is not able to automatically model measures whose relationship is not linear. However, the user can still manually applies transformations on the data by using a data transformer or by creating a derived measure.

293

294

Moreover, such as for the predictive measures, it is important to understand that the algorithms presented previously use data from the past in order to predict future values. It implies several points :

295

296

* The more there will be data, the better the prediction

297

* Algorithms are based on the past, they consider that the future will be similar. Therefore, it won't be possible to predict structural shocks

298

* The accuracy of the prediction can't be guaranteed and must not be considered as deterministic

299

300

= Smart grouping (clustering) =

301

302

== Creation of a smart grouping ==

303

304

Smart grouping (clustering) enables to group members of a dimension according to several measures. The goal here is to group “similar” members together. The similarity notion has to be interpreted in the mathematical way. It means members whose values are close for given measures.

305

306

In the Hierarchies section, click Add then Smart grouping. The Smart grouping editor dialog box shows up:

307

308

[[image:predictive_tutorial_en_html_75a563b04bc9ee46.png||height="315" width="554"]]

309

310

311

The upper part of the window is similar to the one to create a manual hierarchy. We can define the identifier of the hierarchy (by default Group 0), add levels, and changer their identifiers.

312

313

The differences are in the **Level definition** and **Details groups**.

314

315

In the **Level definition** group we define the measures according to which we want to group the members of the current level. You need to select the measures and for each of these measures, select their aggregation.

316

317

The **Details** group enables to precise:

318

319

1. The number of clusters (groups) of the current level

320

1. The name of these clusters

321

322

The name of these clusters can be composed of several keywords that we add thanks to the **Add keyword** button located above the **Cancel** button.

323

324

The keywords available to name the clusters are the following:

325

326

* **${bestMeasures}** which enables to name the clusters according to the name of the 3 most discriminative measures. Each cluster name has the form **Measure1Position,Measure2Position,Measure3Position** where the position is between 0 and the number of clusters of the level, and indicates how the mean of the measure is ranked compare to the other clusters (for more details please see the example at the end of this part).

327

* **${Measure(MeasureName)}** which enables to choose yourself the measures you want to show in the clusters names.

328

329

By default the name of the clusters is equal to **${bestMeasures}**.

330

331

(% class="box warningmessage" %)

332

(((

333

Be careful, it is not possible to combine **${bestMeasures}** with **${Measure(MeasureName)}**.

334

)))

335

336

The user also has the possibility to add personal text in the clusters names.

337

338

To finalize the first step, the user has to click on the **Next** button. This triggers the computation of the groups and brings him to a second window, identical to the one to create a **manual hierarchy** but where the groups are already built.

339

340

Thus, this second step is totally identical to the creation of a manual hierarchy. You strictly have the same possibilities. For further details, please refer to the part on the **creation of a manual hierarchy**.

== Examples ==

__Example 1 :__

We have data about stores of a supermarket chain. We want to create 5 groups of stores, similar in term of advertising budget and annual net sales. We select **${bestMeasures}** to name our groups of stores.

347

348

[[image:predictive_tutorial_en_html_b10c6cf531cabf2c.png||height="317" width="554"]]

349

350

351

By clicking on the next button we get the following result:

352

353

[[image:predictive_tutorial_en_html_214849af462ca32.png||height="315" width="554"]]

354

355

356

We get 5 groups of stores as we asked in the previous step. The group names follows the schema **annual net sales k$Position,k$ spend on adsPosition**.

357

358

For the group **annual net sales k$0,k$ spend on ads0** it means that this group contains stores whose annual net sales and advertising budget is the lowest (rank 0). On the opposite, the group annual net sales k$4,k$ spend on ads4 contains stores whose annual net sales and advertising budget is the highest.

__Example 2 :__

We have data about wines :

363

364

* Quality of the wine (average user rating)

365

* Chemical indicators (pH, alcoholic degree, density, etc)

366

367

We want to group our wines in four groups depending on their quality and their alcoholic degree.

368

369

To do it we complete the **Smart grouping** interface such as indicated on the picture below and then we click on the **Next** button :

370

371

[[image:predictive_tutorial_en_html_ec43ac17c6fab1f7.png||height="315" width="554"]]

372

\\

373

374

In the second window we validate the creation of our hierarchy.

375

376

Then, we create a **scatter plot**. On the **Y axis** we set the alcoholic degree. On the **X axis** the quality of the wines. In the **Bubbles **field we set our dimension id wine at the root level. In the **Cycle color** field we set again id wine but this time at level 0 so that each group will have a different color.

377

378

We get the following graph :

379

380

[[image:predictive_tutorial_en_html_95e3ca2d12161461.png||height="278" width="519"]]

381

382

383

== Mathematical considerations ==

384

385

Groups (clusters) are determined using the Kmeans++ algorithm whose result depends on the initialization parameters. Thus, if you try to create smart grouping multiple times (even with the same parameters), you may not always have the same results.

Wiki source code of Tutoriel Prédictif

Sommaire