Load Data and look at data structure

shipped_raw <- read.csv("POS Output v3.csv",stringsAsFactors=FALSE,fileEncoding="UTF-8-BOM")
shipped_data <- shipped_raw
kable(str(shipped_data))
## 'data.frame':	7195307 obs. of  15 variables:
##  $ Date.Fiscal.Year.     : int  2018 2018 2018 2018 2018 2018 2018 2018 2018 2018 ...
##  $ Date.Fiscal.Week.     : int  45 45 45 45 45 43 43 43 43 43 ...
##  $ Product.Sales.Code.   : chr  "LUB10" "LUB10" "LUB10" "LUB10" ...
##  $ Product.Macrocategory.: chr  "Foliage" "Foliage" "Foliage" "Foliage" ...
##  $ Product.Genus.        : chr  "Calathea" "Calathea" "Calathea" "Calathea" ...
##  $ X.Sales_Qty.          : int  114 103 25 16 18 2 20 66 62 14 ...
##  $ X.Shipped_Qty.        : int  0 0 10 NA 0 NA NA 25 30 50 ...
##  $ X.COGS.               : num  890 805 195 125 143 ...
##  $ Customer.Customer.    : chr  "Lowe's" "Lowe's" "Lowe's" "Lowe's" ...
##  $ Store.Macro.Region.   : chr  "SOUTHEAST" "SOUTHEAST" "SOUTHEAST" "SOUTHCENTRAL" ...
##  $ Store.Region.         : chr  "Region 2" "Region 2" "Region 2" "Region 7" ...
##  $ Store.Sub.Region.     : chr  "8" "7" "8.1" "9" ...
##  $ Product.Category.     : chr  "FLG - Large Indoor" "FLG - Large Indoor" "FLG - Large Indoor" "FLG - Large Indoor" ...
##  $ X.Sales_Amt.          : num  1139 803 221 177 133 ...
##  $ X.LY_Shipped_Qty.     : int  NA NA NA NA NA NA NA 15 38 NA ...

Check null values by column

colSums(sapply(shipped_data,is.na))
##      Date.Fiscal.Year.      Date.Fiscal.Week.    Product.Sales.Code. 
##                  50159                  50159                      0 
## Product.Macrocategory.         Product.Genus.           X.Sales_Qty. 
##                      0                      0                 428379 
##         X.Shipped_Qty.                X.COGS.     Customer.Customer. 
##                5988538                 436720                      0 
##    Store.Macro.Region.          Store.Region.      Store.Sub.Region. 
##                      0                      0                      0 
##      Product.Category.           X.Sales_Amt.      X.LY_Shipped_Qty. 
##                      0                 428379                5938308

Drop rows that have null values for Year/Week and check nulls again

shipped_data <- shipped_data[!is.na(shipped_data$Date.Fiscal.Year), ] 
colSums(sapply(shipped_data,is.na))
##      Date.Fiscal.Year.      Date.Fiscal.Week.    Product.Sales.Code. 
##                      0                      0                      0 
## Product.Macrocategory.         Product.Genus.           X.Sales_Qty. 
##                      0                      0                 378220 
##         X.Shipped_Qty.                X.COGS.     Customer.Customer. 
##                5938401                 386561                      0 
##    Store.Macro.Region.          Store.Region.      Store.Sub.Region. 
##                      0                      0                      0 
##      Product.Category.           X.Sales_Amt.      X.LY_Shipped_Qty. 
##                      0                 378220                5938308

Find the number of rows that Shipped Quantity are null for both this year and the prior year.

nrow(shipped_data[is.na(shipped_data$X.Shipped_Qty.) & is.na(shipped_data$X.LY_Shipped_Qty.),])
## [1] 5217147

Find the percentage of rows that are missing for both year and last year as a percent of missing data

nrow(shipped_data[is.na(shipped_data$X.Shipped_Qty.) & is.na(shipped_data$X.LY_Shipped_Qty.),])/nrow(shipped_data[is.na(shipped_data$X.Shipped_Qty.),])
## [1] 0.8785441

It looks like the Shipped Quantity data was joined to a larger data set which resulted in nulls for those combinations of columns that didn’t join 1 for 1.

For now I will just replace the rows with null values with zeroes in order to keep the additional information stored in these rows.

shipped_data$X.Shipped_Qty.[is.na(shipped_data$X.Shipped_Qty.)] <- 0
shipped_data$X.LY_Shipped_Qty.[is.na(shipped_data$X.LY_Shipped_Qty.)] <- 0

#Find the number of rows that are missing both Sales Qty and Sales Amount. Turns out to be the same rows.

nrow(shipped_data[is.na(shipped_data$X.Sales_Qty.) & is.na(shipped_data$X.Sales_Amt.),])
## [1] 378220

Find the number of rows that are missing both Sales Qty and COGS. It seems there are a few additional rows missing for COGS, but for the most part they are both missing the same rows.

nrow(shipped_data[is.na(shipped_data$X.Sales_Qty.) & is.na(shipped_data$X.COGS.),])
## [1] 378220
nrow(shipped_data[is.na(shipped_data$X.Sales_Qty.) & is.na(shipped_data$X.COGS.),])/nrow(shipped_data[is.na(shipped_data$X.COGS.),])
## [1] 0.9784226

For now I will just change all these nulls to zero

shipped_data[is.na(shipped_data)] <- 0

Create a date variable to view time series data. First, I have to add a zero in front of numbers 1 through 9 in order to be able to convert it correctly.

shipped_data$Date.Fiscal.Week.[shipped_data$Date.Fiscal.Week.<10] <- paste0("0",shipped_data$Date.Fiscal.Week.[shipped_data$Date.Fiscal.Week.<10])
shipped_data$Date <- paste(shipped_data$Date.Fiscal.Year.,shipped_data$Date.Fiscal.Week.,sep = "-")

Change variables to correct data type

rapply(shipped_data,function(x)length(unique(x)))
##      Date.Fiscal.Year.      Date.Fiscal.Week.    Product.Sales.Code. 
##                      6                     53                   5980 
## Product.Macrocategory.         Product.Genus.           X.Sales_Qty. 
##                     11                    542                   8563 
##         X.Shipped_Qty.                X.COGS.     Customer.Customer. 
##                   7777                 705971                      3 
##    Store.Macro.Region.          Store.Region.      Store.Sub.Region. 
##                      8                     14                     18 
##      Product.Category.           X.Sales_Amt.      X.LY_Shipped_Qty. 
##                     45                 861752                   7776 
##                   Date 
##                    310
act_factors <- names(shipped_data)[which(sapply(shipped_data,function(x)length(unique(x))<=100)|sapply(shipped_data,is.character))]
shipped_data[act_factors]<- lapply(shipped_data[act_factors],factor)
kable(str(shipped_data))
## 'data.frame':	7145148 obs. of  16 variables:
##  $ Date.Fiscal.Year.     : Factor w/ 6 levels "2015","2016",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Date.Fiscal.Week.     : Factor w/ 53 levels "01","02","03",..: 45 45 45 45 45 43 43 43 43 43 ...
##  $ Product.Sales.Code.   : Factor w/ 5980 levels "","1.00ACHFRV",..: 4360 4360 4360 4360 4360 4360 4360 4360 4360 4360 ...
##  $ Product.Macrocategory.: Factor w/ 11 levels "","Annuals","Blooming Tropicals",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ Product.Genus.        : Factor w/ 542 levels "","Acalypha",..: 79 79 79 79 79 79 79 79 79 79 ...
##  $ X.Sales_Qty.          : num  114 103 25 16 18 2 20 66 62 14 ...
##  $ X.Shipped_Qty.        : num  0 0 10 0 0 0 0 25 30 50 ...
##  $ X.COGS.               : num  890 805 195 125 143 ...
##  $ Customer.Customer.    : Factor w/ 3 levels "Home Depot","Lowe's",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Store.Macro.Region.   : Factor w/ 8 levels "","N/A","NORTHEAST/CENTRAL",..: 6 6 6 5 5 6 5 6 6 6 ...
##  $ Store.Region.         : Factor w/ 14 levels "","N/A","Region 1",..: 7 7 7 12 12 7 12 7 7 7 ...
##  $ Store.Sub.Region.     : Factor w/ 18 levels "","#N/A","10",..: 14 13 15 16 14 12 16 13 14 15 ...
##  $ Product.Category.     : Factor w/ 45 levels "","ANS - Annuals",..: 23 23 23 23 23 23 23 23 23 23 ...
##  $ X.Sales_Amt.          : num  1139 803 221 177 133 ...
##  $ X.LY_Shipped_Qty.     : num  0 0 0 0 0 0 0 15 38 0 ...
##  $ Date                  : Factor w/ 310 levels "2015-01","2015-02",..: 204 204 204 204 204 202 202 202 202 202 ...

Next, I need to aggregate all my data by the group of variables that I want to build a forecast.

shipped_agg <- shipped_data%>%
  group_by(Date.Fiscal.Year.,Date.Fiscal.Week.,Customer.Customer.,Store.Sub.Region.,Product.Category.,Date)%>%
  summarize(CY_Shipped = sum(X.Shipped_Qty.),
            PY_Shipped = sum(X.LY_Shipped_Qty.),
            COGS = sum(X.COGS.),
            Sales_Qty = sum(X.Sales_Qty.),
            Sales_Amt = sum(X.Sales_Amt.))

shipped_agg <- as.data.frame(shipped_agg)

Now I will inspect the data visually to see if anything abnormal catches my attention. First, I will inspect the Date variable I built earlier.

my_breaks <- seq(1,length(unique(shipped_agg$Date)),by=10)

ggplot(shipped_agg, aes(x = Date, y=CY_Shipped )) +
  geom_bar(stat="identity") +
  theme(axis.text.x = element_text(angle=65, vjust=0.6))+
  scale_x_discrete(
    breaks=levels(shipped_agg$Date)[my_breaks])

It’s hard to tell but it looks like we are missing data at the end of our Date range. Let’s check it out.

kable(shipped_agg%>%
  group_by(Date)%>%
  summarize(count=n(),
            total = sum(CY_Shipped)))%>%
  kable_styling() %>%
  scroll_box(width = "500px", height = "200px")
Date count total
2015-01 1104 176314
2015-02 1025 475680
2015-03 1025 718916
2015-04 1034 768818
2015-05 1002 1265184
2015-06 1010 590407
2015-07 1040 589550
2015-08 1012 858661
2015-09 1024 1334855
2015-10 1022 1842944
2015-11 1034 2208468
2015-12 1043 3725464
2015-13 1059 3351484
2015-14 1074 3972595
2015-15 1080 3560389
2015-16 1100 3600819
2015-17 1117 3460655
2015-18 1117 3585257
2015-19 1120 2972426
2015-20 1129 3073670
2015-21 1125 2738860
2015-22 1112 2226360
2015-23 1118 1497526
2015-24 1112 1441926
2015-25 1128 1176588
2015-26 1106 1153929
2015-27 1114 776074
2015-28 1120 813471
2015-29 1128 868441
2015-30 1126 799869
2015-31 1110 1054440
2015-32 1111 594522
2015-33 1102 961393
2015-34 1114 1006242
2015-35 1095 900457
2015-36 1096 1056685
2015-37 1094 673180
2015-38 1102 1019734
2015-39 1106 1085217
2015-40 1106 943417
2015-41 1103 688043
2015-42 1107 508911
2015-43 1123 845380
2015-44 1111 1109528
2015-45 1115 1237812
2015-46 1106 756133
2015-47 1113 1096805
2015-48 1099 820066
2015-49 1096 553715
2015-50 1092 787819
2015-51 1096 734825
2015-52 1093 196751
2015-53 1090 922020
2016-01 1087 37166
2016-02 1087 651738
2016-03 1078 800541
2016-04 1053 547300
2016-05 1041 1266935
2016-06 1039 654449
2016-07 1022 943461
2016-08 1042 936292
2016-09 1049 1712264
2016-10 1063 1735562
2016-11 1066 3207721
2016-12 1092 3794607
2016-13 1091 4047196
2016-14 1105 3482648
2016-15 1122 3944928
2016-16 1141 3671362
2016-17 1154 3831141
2016-18 1154 3929474
2016-19 1159 3175573
2016-20 1168 3318393
2016-21 1168 2721664
2016-22 1168 2454912
2016-23 1166 1492611
2016-24 1172 1595307
2016-25 1171 1340768
2016-26 1166 1237812
2016-27 1160 1241507
2016-28 1160 676702
2016-29 1162 997467
2016-30 1153 813261
2016-31 1155 1560370
2016-32 1149 933326
2016-33 1140 782842
2016-34 1140 1013357
2016-35 1144 827290
2016-36 1149 1191819
2016-37 1146 908669
2016-38 1148 1151976
2016-39 1145 1014722
2016-40 1162 841974
2016-41 1142 547914
2016-42 1154 891777
2016-43 1157 806714
2016-44 1158 1171225
2016-45 1151 1077085
2016-46 1155 745533
2016-47 1149 812291
2016-48 1144 954714
2016-49 1140 647225
2016-50 1136 492482
2016-51 1131 857453
2016-52 1121 651736
2016-53 1116 1055611
2017-01 1104 507358
2017-02 1090 720732
2017-03 1077 1036649
2017-04 1085 1386120
2017-05 1083 1279695
2017-06 1094 1518976
2017-07 1084 1504738
2017-08 1104 2248669
2017-09 1103 2727403
2017-10 1096 3387558
2017-11 1116 2110491
2017-12 1116 2870845
2017-13 1102 3972457
2017-14 1117 3938165
2017-15 1145 4077093
2017-16 1147 3777769
2017-17 1164 4243976
2017-18 1147 3438270
2017-19 1164 2436509
2017-20 1150 2473438
2017-21 1144 3048414
2017-22 1139 1494081
2017-23 1128 1642386
2017-24 1138 1475733
2017-25 1127 1772163
2017-26 1127 1256671
2017-27 1118 927627
2017-28 1123 1020671
2017-29 1099 974178
2017-30 1109 1454635
2017-31 1107 843826
2017-32 1101 706822
2017-33 1114 898326
2017-34 1106 797766
2017-35 1107 1414836
2017-36 1101 785003
2017-37 1106 587984
2017-38 1092 1112209
2017-39 1101 1269767
2017-40 1094 602126
2017-41 1109 705206
2017-42 1099 1006947
2017-43 1098 891128
2017-44 1085 869906
2017-45 1081 705248
2017-46 1095 1094282
2017-47 1089 1001326
2017-48 1093 887124
2017-49 1084 872929
2017-50 1077 953712
2017-51 1056 463291
2017-52 1053 901877
2017-53 1038 0
2018-01 1060 325576
2018-02 1065 677086
2018-03 1054 594192
2018-04 1058 966243
2018-05 1048 1153339
2018-06 1059 1227725
2018-07 1068 1538160
2018-08 1074 2376844
2018-09 1082 2746304
2018-10 1077 3073629
2018-11 1090 2633524
2018-12 1088 3246326
2018-13 1081 2972258
2018-14 1084 3434567
2018-15 1091 3879499
2018-16 1082 3324898
2018-17 1106 3812500
2018-18 1098 3615039
2018-19 1113 3608791
2018-20 1126 3433668
2018-21 1118 2769745
2018-22 1105 2148625
2018-23 1102 1584570
2018-24 1111 1679610
2018-25 1106 1265381
2018-26 1101 1137080
2018-27 1101 901467
2018-28 1111 925816
2018-29 1099 1023262
2018-30 1095 1440600
2018-31 1088 1209750
2018-32 1089 1021708
2018-33 1105 766932
2018-34 1087 1035260
2018-35 1085 1446093
2018-36 1078 1162404
2018-37 1078 1282269
2018-38 1079 1345707
2018-39 1082 1284394
2018-40 1080 920550
2018-41 1076 857666
2018-42 1076 922532
2018-43 1074 1017598
2018-44 1069 1056534
2018-45 1069 915206
2018-46 1063 1117478
2018-47 1058 1032044
2018-48 1060 879955
2018-49 1052 813069
2018-50 1050 715880
2018-51 1047 493746
2018-52 1057 764003
2018-53 1044 57252
2019-01 1048 747982
2019-02 1048 794494
2019-03 1050 975622
2019-04 1052 1129077
2019-05 1053 1220011
2019-06 1045 1253905
2019-07 1044 1541032
2019-08 1049 1803261
2019-09 1051 2636057
2019-10 1044 2370363
2019-11 1047 2961700
2019-12 1049 3730900
2019-13 1051 3904287
2019-14 1057 4309552
2019-15 1053 4676142
2019-16 1071 4897571
2019-17 1086 5194951
2019-18 1097 4291125
2019-19 1100 4127173
2019-20 1127 4065942
2019-21 1125 3623816
2019-22 1127 1523581
2019-23 1122 1822741
2019-24 1241 1480971
2019-25 1245 1700950
2019-26 1235 1492285
2019-27 1235 1098727
2019-28 1233 941500
2019-29 1230 893052
2019-30 1232 1025655
2019-31 1239 1186058
2019-32 1228 1189090
2019-33 1230 1106885
2019-34 1230 1274547
2019-35 1229 1322772
2019-36 1229 671083
2019-37 1225 1216611
2019-38 1228 1091560
2019-39 1226 1208336
2019-40 1228 935742
2019-41 1224 802683
2019-42 1230 1011143
2019-43 1234 1060517
2019-44 554 273411
2019-45 497 167728
2019-46 476 136748
2019-47 410 0
2019-48 442 0
2019-49 427 0
2019-50 368 0
2019-51 435 0
2019-52 457 0
2019-53 282 34
2020-01 355 0
2020-02 525 0
2020-03 537 0
2020-04 528 0
2020-05 563 0
2020-06 587 0
2020-07 620 0
2020-08 607 0
2020-09 594 0
2020-10 664 0
2020-11 679 0
2020-12 698 0
2020-13 724 0
2020-14 707 0
2020-15 709 0
2020-16 731 0
2020-17 710 0
2020-18 735 0
2020-19 714 0
2020-20 686 0
2020-21 700 0
2020-22 555 0
2020-23 612 0
2020-24 632 0
2020-25 647 0
2020-26 622 0
2020-27 580 0
2020-28 547 0
2020-29 559 0
2020-30 531 0
2020-31 577 0
2020-32 583 0
2020-33 561 0
2020-34 618 0
2020-35 620 0
2020-36 442 0
2020-37 525 0
2020-38 528 0
2020-39 517 0
2020-40 504 0
2020-41 462 0
2020-42 483 0
2020-43 487 0
2020-44 377 0
2020-45 42 0
shipped_agg <- shipped_agg%>%
  filter(as.integer(Date) %in% c(1:255))%>%
  droplevels

Let’s check it out again. Looks good!

my_breaks <- seq(1,length(unique(shipped_agg$Date)),by=10)

ggplot(shipped_agg, aes(x = Date, y=CY_Shipped )) +
  geom_bar(stat="identity") +
  theme(axis.text.x = element_text(angle=65, vjust=0.6))+
  scale_x_discrete(
    breaks=levels(shipped_agg$Date)[my_breaks])

By Year

ggplot(shipped_agg, aes(x = Date.Fiscal.Year., y=CY_Shipped )) +
  geom_bar(stat="identity") +
  theme(axis.text.x = element_text(angle=65, vjust=0.6))

By Week

ggplot(shipped_agg, aes(x = Date.Fiscal.Week. , y= CY_Shipped)) +
  geom_bar(stat="identity") +
  theme(axis.text.x = element_text(angle=65, vjust=0.6))

By Customer

ggplot(shipped_agg, aes(x = Customer.Customer. , y= CY_Shipped)) +
  geom_bar(stat="identity") +
  theme(axis.text.x = element_text(angle=65, vjust=0.6))

By Store Sub Region

ggplot(shipped_agg, aes(x = Store.Sub.Region. , y= CY_Shipped)) +
  geom_bar(stat="identity") +
  theme(axis.text.x = element_text(angle=65, vjust=0.6))

It looks like we need to get rid of blank and n/a levels in Store Region

kable(shipped_agg%>%
  group_by(Store.Sub.Region.)%>%
  summarize(count=n(),
            total = sum(CY_Shipped),
            ave=mean(CY_Shipped)))%>%
  kable_styling() %>%
  scroll_box(width = "500px", height = "200px")
Store.Sub.Region. count total ave
157 0 0.000000
#N/A 174 0 0.000000
10 24832 32071188 1291.526579
10.1 22568 12992922 575.723236
11 7575 136303 17.993795
12 6878 540728 78.617040
13 7092 797166 112.403553
2 707 1187 1.678925
3 10708 202581 18.918659
4 17334 3433286 198.066574
5 19491 17632493 904.647940
6 22677 53715734 2368.731931
7 27946 93250596 3336.813712
8 28444 94140291 3309.671319
8.1 24132 21391450 886.435024
9 26022 53115353 2041.171048
9.1 23708 28201860 1189.550363
N/A 12238 795341 64.989459
shipped_agg <- shipped_agg%>%
  filter(!Store.Sub.Region. %in% c("","#N/A",'2'))%>%
  droplevels

By Product Category

ggplot(shipped_agg, aes(x = Product.Category. , y= CY_Shipped)) +
  geom_bar(stat="identity") +
  theme(axis.text.x = element_text(angle=65, vjust=0.6))

It looks like we need to get rid of a blank level in Product Category

kable(shipped_agg%>%
  group_by(Product.Category.)%>%
  summarize(count=n(),
            total = sum(CY_Shipped)))%>%
  kable_styling() %>%
  scroll_box(width = "500px", height = "200px")
Product.Category. count total
130 0
ANS - Annuals 5255 112779933
ANS - Bulb 1236 160781
ANS - Combos 3740 1770680
ANS - Edible 1933 42887
ANS - Ferns 2283 511271
ANS - Mums 4079 8451513
ANS - Pansy 2914 7729403
Blooming Tropicals 20 918
BTS - Combos 6999 686589
BTS - Hibiscus 9239 11669884
BTS - Mandevilla 9284 13060827
BTS - Patio Tropicals 9039 3507474
BTS - Shrubs 8668 4502993
C&S - Cacti 9337 3014441
C&S - Dish Gardens 8032 978072
C&S - Landscape 8582 2410353
C&S - Succulents 8446 8130707
FLG - Baskets 9592 7747728
FLG - Combos 7848 602276
FLG - Indoor Palms 9671 9658999
FLG - Landscape Palms 6601 220809
FLG - Large Indoor 9704 13004025
FLG - Shrubs 9506 9114587
FLG - Small Indoor 9772 51483707
FNS - Boston 8684 6270254
HLY - Pines 8089 3220867
HLY - Poinsettias 3479 6173471
HLY - Zygo 5080 472828
N/A 9382 36745837
PNS - Combos 2700 191704
PNS - Ferns 1975 691902
PNS - Grass 3276 3640043
PNS - Ground Cover 2799 6649366
PNS - Hosta 2244 4909560
PNS - Lily 5154 7984826
PNS - Perennials 5218 21069047
PNS - Vining 1643 533286
SPY - Bonsai 9354 1624706
SPY - Bromeliad & Anthuriums 9403 4665225
SPY - Lucky Bamboo 9420 5440838
SPY - Orchids 7440 3881219
SPY - Specialty 9173 1060674
SPY - Specialty Gardens 8338 583502
Unknown 6884 25367280
shipped_agg <- shipped_agg%>%
  filter(!Product.Category. %in% c("","Blooming Tropicals"))%>%
  droplevels

A very important step in this project was to create lagged version of all my Shipping, Sales, and COGS data. First, I must create a data frame that has a row for every unique combination of Customer/Store Sub Region/Product Category/Date. Then, I need to join this data back to my original data set so that the lag function is able to correctly create a lagged number. Finally, I created 4 lagged variables for each of the Shipping, Sales, and COGS variables.

lagged_df = expand.grid(
  Customer.Customer. = sort(unique(shipped_agg$Customer.Customer.)),
  Store.Sub.Region. = sort(unique(shipped_agg$Store.Sub.Region.)),
  Product.Category. = sort(unique(shipped_agg$Product.Category.)),
  Date = sort(unique(shipped_agg$Date))) %>% 
  left_join(shipped_agg, by = c("Customer.Customer.","Store.Sub.Region.","Product.Category.","Date")) %>% 
  arrange(Customer.Customer.,Store.Sub.Region.,Product.Category.,Date) %>% 
  group_by(Customer.Customer.,Store.Sub.Region.,Product.Category.) %>% 
  mutate(CY_Shipped_Lag1 = lag(CY_Shipped,n = 1),
         CY_Shipped_Lag2 = lag(CY_Shipped,n = 2),
         CY_Shipped_Lag3 = lag(CY_Shipped,n = 3),
         CY_Shipped_Lag4 = lag(CY_Shipped,n = 4),
         PY_Shipped_Lag1 = lag(PY_Shipped,n = 1),
         PY_Shipped_Lag2 = lag(PY_Shipped,n = 2),
         PY_Shipped_Lag3 = lag(PY_Shipped,n = 3),
         PY_Shipped_Lag4 = lag(PY_Shipped,n = 4),
         Sales_Lag1 = lag(Sales_Amt,n = 1),
         Sales_Lag2 = lag(Sales_Amt,n = 2),
         Sales_Lag3 = lag(Sales_Amt,n = 3),
         Sales_Lag4 = lag(Sales_Amt,n = 4),
         COGS_Lag1 = lag(COGS,n = 1),
         COGS_Lag2 = lag(COGS,n = 2),
         COGS_Lag3 = lag(COGS,n = 3),
         COGS_Lag4 = lag(COGS,n = 4)) %>% 
  ungroup()

Determine how many null values are in the new data set. Year, Week, Sales, and Cogs are all missing the same data from the join in the previous step.

colSums(sapply(lagged_df,is.na))
## Customer.Customer.  Store.Sub.Region.  Product.Category. 
##                  0                  0                  0 
##               Date  Date.Fiscal.Year.  Date.Fiscal.Week. 
##                  0             211930             211930 
##         CY_Shipped         PY_Shipped               COGS 
##             211930             211930             211930 
##          Sales_Qty          Sales_Amt    CY_Shipped_Lag1 
##             211930             211930             213146 
##    CY_Shipped_Lag2    CY_Shipped_Lag3    CY_Shipped_Lag4 
##             214359             215564             216773 
##    PY_Shipped_Lag1    PY_Shipped_Lag2    PY_Shipped_Lag3 
##             213146             214359             215564 
##    PY_Shipped_Lag4         Sales_Lag1         Sales_Lag2 
##             216773             213146             214359 
##         Sales_Lag3         Sales_Lag4          COGS_Lag1 
##             215564             216773             213146 
##          COGS_Lag2          COGS_Lag3          COGS_Lag4 
##             214359             215564             216773

Drop the missing values that were created during the join.

lagged_df <- lagged_df[!is.na(lagged_df$Date.Fiscal.Week.),] 

Drop the missing values that were created by lagging

lagged_df <- as.data.frame(lagged_df[complete.cases(lagged_df),])

lagged_df <- lagged_df%>%
  select(-c("Date"))

colSums(sapply(lagged_df,is.na))
## Customer.Customer.  Store.Sub.Region.  Product.Category. 
##                  0                  0                  0 
##  Date.Fiscal.Year.  Date.Fiscal.Week.         CY_Shipped 
##                  0                  0                  0 
##         PY_Shipped               COGS          Sales_Qty 
##                  0                  0                  0 
##          Sales_Amt    CY_Shipped_Lag1    CY_Shipped_Lag2 
##                  0                  0                  0 
##    CY_Shipped_Lag3    CY_Shipped_Lag4    PY_Shipped_Lag1 
##                  0                  0                  0 
##    PY_Shipped_Lag2    PY_Shipped_Lag3    PY_Shipped_Lag4 
##                  0                  0                  0 
##         Sales_Lag1         Sales_Lag2         Sales_Lag3 
##                  0                  0                  0 
##         Sales_Lag4          COGS_Lag1          COGS_Lag2 
##                  0                  0                  0 
##          COGS_Lag3          COGS_Lag4 
##                  0                  0

Correlations

I want to look to see how correlated the numeric variables are currently. This will give me a feel for what types of models that I can use on this data set.

corr <- round(cor(lagged_df[6:26]), 1)

# Plot
library('ggcorrplot')
ggcorrplot(corr, hc.order = TRUE, 
           type = "lower", 
           lab = TRUE, 
           lab_size = 3, 
           method="circle", 
           colors = c("tomato2", "white", "springgreen3"), 
           title="Correlogram of Numeric Data", 
           ggtheme=theme_bw)

Model creation

First, I need to scale all of the data so that I can determine the most important features later. Then I dummy all factor variables and create a test and training set.

Scale data

preprocessParams<-preProcess(lagged_df, method = c("center", "scale"))
lagged_df <- predict(preprocessParams, lagged_df)

Dummy factor variables

factor_vars <- names(lagged_df)[which(sapply(lagged_df,is.factor))]
dummies <- dummyVars(~.,lagged_df[factor_vars])
hot_coded <- predict(dummies,lagged_df[factor_vars])

numeric_vars <- names(lagged_df)[which(sapply(lagged_df,is.numeric))]
num_df <- lagged_df[numeric_vars]
model_df_onehot <- cbind(num_df,hot_coded)

Split the data based on year. 2015-2018 in the training set and 2019 in the validation test set.

trainSet <- model_df_onehot[model_df_onehot$Date.Fiscal.Year..2019=='0',]
testSet <- model_df_onehot[model_df_onehot$Date.Fiscal.Year..2019=='1',]

Drop the Year variables so that we don’t get any errors when running the models

trainSet <- trainSet%>%
  select(-c(Date.Fiscal.Year..2015,Date.Fiscal.Year..2016, Date.Fiscal.Year..2017,Date.Fiscal.Year..2018,Date.Fiscal.Year..2019))

testSet <- testSet%>%
  select(-c(Date.Fiscal.Year..2015,Date.Fiscal.Year..2016, Date.Fiscal.Year..2017,Date.Fiscal.Year..2018,Date.Fiscal.Year..2019))

Create three models. A lasso,ridge, and linear model.

TARGET.VAR <- "CY_Shipped"
candidate.features <- setdiff(names(trainSet),c(TARGET.VAR))
set.seed(123)
control <- trainControl(method="cv", 
                        number=5)


parameters <- c(seq(0.1, 2, by =0.1) ,  seq(2, 5, 0.5) , seq(5, 100, 1))

lasso<-train(x= trainSet[,candidate.features],
             y = trainSet[,TARGET.VAR],
             method = 'glmnet', 
             trControl=control,
             tuneGrid = expand.grid(alpha = 1, lambda = parameters) ,
             metric =  "RMSE")

ridge<-train(x= trainSet[,candidate.features],
             y = trainSet[,TARGET.VAR],
             method = 'glmnet', 
             trControl=control,
             tuneGrid = expand.grid(alpha = 0, lambda = parameters),
             metric =  "RMSE")

linear<-train(x= trainSet[,candidate.features],
              y = trainSet[,TARGET.VAR],
              trControl=control, 
              method = 'lm',
              metric =  "RMSE")

Model Results

It looks like all the models did fairly well with the training data set with the Linear model outperforming in both RMSE and R-squared.

model_list <- list(lasso=lasso,ridge=ridge,linear=linear)
summary(resamples(model_list))
## 
## Call:
## summary.resamples(object = resamples(model_list))
## 
## Models: lasso, ridge, linear 
## Number of resamples: 5 
## 
## MAE 
##              Min.    1st Qu.     Median       Mean    3rd Qu.       Max.
## lasso  0.09742777 0.09827569 0.09859958 0.09906175 0.09948065 0.10152508
## ridge  0.09296833 0.09415144 0.09491919 0.09456453 0.09495086 0.09583285
## linear 0.09348050 0.09440606 0.09516680 0.09531440 0.09595117 0.09756750
##        NA's
## lasso     0
## ridge     0
## linear    0
## 
## RMSE 
##             Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## lasso  0.4092295 0.4208671 0.4242020 0.4268525 0.4355914 0.4443723    0
## ridge  0.3787219 0.3960977 0.3982898 0.4045070 0.4206382 0.4287874    0
## linear 0.3891585 0.3911724 0.3996394 0.4005655 0.4110404 0.4118168    0
## 
## Rsquared 
##             Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## lasso  0.7877135 0.7920025 0.8000212 0.8052528 0.8015361 0.8449907    0
## ridge  0.7910908 0.8047406 0.8121815 0.8145672 0.8283172 0.8365058    0
## linear 0.8003835 0.8091260 0.8156204 0.8174309 0.8301179 0.8319066    0

Variable Importance

Both the linear and ridge models had similar variables they deemed important. However, the lasso regression model tossed out all but 4 variables.This is due to multicollinearity in the data as seen in the correlation plot above. I will testing the data on more algorithms in a separte markdown.

plot(varImp(ridge),top=20)

plot(varImp(lasso),top=20)

plot(varImp(linear),top=20)

Return to my portfolio