Variable Reduction Methods

The following is a project that I worked on for a national wholesale plant distributer. The purpose of this project was to reduce the number of variables used to predict plant sales in Dallas. To do this, I used both Boruta analysis and Recursive Feature Elimination analysis to determine the most important data features. Next, I validated the analysis of these two variable reduction methods by testing multiple predictive models and comparing results.

First I load the raw data file, check the structure of the data to determine if it is in the correct measurement scale, and finally check for missing data.

setwd("C:\\Users\\e16582\\Documents\\Datalore Projects\\Weather Data\\")
Weather_Data_Raw <- read.csv("Foliage_Dallas_Chicago v2.csv")
str(Weather_Data_Raw)

## 'data.frame':	2460 obs. of  42 variables:
##  $ Date..Year.              : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
##  $ Date..Month.             : Factor w/ 12 levels "Apr","Aug","Dec",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ WeekOfYear               : int  1 1 1 1 1 1 2 2 2 2 ...
##  $ DayOfWeek                : Factor w/ 7 levels "Friday","Monday",..: 1 1 3 3 5 5 1 1 2 2 ...
##  $ City                     : Factor w/ 2 levels "Chicago","Dallas": 1 2 1 2 1 2 1 2 1 2 ...
##  $ Day                      : int  2 2 3 3 1 1 9 9 5 5 ...
##  $ Date                     : Factor w/ 1230 levels "1/1/2015","1/1/2016",..: 45 45 89 89 1 1 121 121 105 105 ...
##  $ HolidayText              : Factor w/ 15 levels "(blank)","Christmas Day",..: 1 1 1 1 11 11 1 1 1 1 ...
##  $ POS.Sales..              : num  434 633 408 1285 328 ...
##  $ Maximum.of.fog           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Maximum.of.rain          : int  0 1 1 1 0 1 0 0 0 0 ...
##  $ Maximum.of.snow          : int  0 0 1 0 0 0 1 0 1 0 ...
##  $ Maximum.of.snow_fall     : int  0 0 0 0 0 0 0 0 1 0 ...
##  $ Maximum.of.mtd_snow      : int  0 0 0 0 0 0 7 0 4 0 ...
##  $ Maximum.of.since_jul_snow: int  2 0 3 0 2 0 10 0 7 0 ...
##  $ Maximum.of.snow_depth    : int  0 0 0 0 0 0 5 0 2 0 ...
##  $ Maximum.of.hail          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Maximum.of.thunder       : int  0 1 0 1 0 0 0 0 0 0 ...
##  $ Maximum.of.tornado       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Maximum.of.mean_temp     : int  25 39 32 46 23 35 6 33 2 34 ...
##  $ Maximum.of.mean_dewp     : int  18 38 30 37 10 33 -4 13 -4 19 ...
##  $ Maximum.of.mean_pressure : int  30 30 29 29 30 30 30 30 30 30 ...
##  $ Maximum.of.mean_wind_spd : int  5 6 6 9 15 4 17 11 10 9 ...
##  $ Maximum.of.mean_wind_dir : int  184 5 59 267 244 26 284 21 254 139 ...
##  $ Maximum.of.mean_visib    : int  10 4 3 10 10 4 8 10 6 10 ...
##  $ Maximum.of.humid         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Maximum.of.max_temp      : int  34 41 34 54 32 37 11 39 6 46 ...
##  $ Maximum.of.min_temp      : int  15 37 29 38 14 33 0 27 -3 21 ...
##  $ Maximum.of.max_humid     : int  84 97 100 100 68 97 76 63 87 74 ...
##  $ Maximum.of.min_humid     : int  47 89 69 50 49 73 54 32 59 33 ...
##  $ Maximum.of.max_dew_pt    : int  21 40 34 39 15 36 5 21 2 27 ...
##  $ Maximum.of.min_dew_pt    : int  13 36 21 33 3 26 -10 6 -11 13 ...
##  $ Maximum.of.max_pressure  : int  30 30 30 30 30 30 30 30 30 30 ...
##  $ Maximum.of.min_pressure  : int  30 29 29 29 29 30 29 30 30 30 ...
##  $ Maximum.of.max_wind_spd  : int  12 12 13 15 24 8 30 22 22 20 ...
##  $ Maximum.of.min_wind_spd  : int  0 0 0 0 7 0 8 0 0 4 ...
##  $ Maximum.of.max_vis       : int  10 10 10 10 10 10 10 10 10 10 ...
##  $ Maximum.of.min_vis       : int  10 1 1 6 10 2 2 10 1 10 ...
##  $ Maximum.of.gdegree       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Maximum.of.cooling_days  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Maximum.of.heating_days  : int  40 26 33 19 42 30 59 32 63 31 ...
##  $ Maximum.of.precip        : int  0 0 0 0 0 1 0 0 0 0 ...

colSums(sapply(Weather_Data_Raw,is.na))

##               Date..Year.              Date..Month. 
##                         0                         0 
##                WeekOfYear                 DayOfWeek 
##                         0                         0 
##                      City                       Day 
##                         0                         0 
##                      Date               HolidayText 
##                         0                         0 
##               POS.Sales..            Maximum.of.fog 
##                         0                         0 
##           Maximum.of.rain           Maximum.of.snow 
##                         0                         0 
##      Maximum.of.snow_fall       Maximum.of.mtd_snow 
##                         0                         0 
## Maximum.of.since_jul_snow     Maximum.of.snow_depth 
##                         0                         0 
##           Maximum.of.hail        Maximum.of.thunder 
##                         0                         0 
##        Maximum.of.tornado      Maximum.of.mean_temp 
##                         0                         0 
##      Maximum.of.mean_dewp  Maximum.of.mean_pressure 
##                         0                         0 
##  Maximum.of.mean_wind_spd  Maximum.of.mean_wind_dir 
##                         0                         0 
##     Maximum.of.mean_visib          Maximum.of.humid 
##                         0                         0 
##       Maximum.of.max_temp       Maximum.of.min_temp 
##                         0                         0 
##      Maximum.of.max_humid      Maximum.of.min_humid 
##                         0                         0 
##     Maximum.of.max_dew_pt     Maximum.of.min_dew_pt 
##                         0                         0 
##   Maximum.of.max_pressure   Maximum.of.min_pressure 
##                         0                         0 
##   Maximum.of.max_wind_spd   Maximum.of.min_wind_spd 
##                         0                         0 
##        Maximum.of.max_vis        Maximum.of.min_vis 
##                         0                         0 
##        Maximum.of.gdegree   Maximum.of.cooling_days 
##                         0                         0 
##   Maximum.of.heating_days         Maximum.of.precip 
##                         0                         0

It looks like there are no missing values. However, there are a number of variables that are currently read as integers that should be changed to a factor variable for modeling purposes such as ‘Year’. I would also like check to see if there are any binary variables that need to be changed to factor. I can do this by looking at the number of unique levels for each variable. For integer variables with only two levels, I will change those to binary factors. I also notice that I need to rename a number of variables to remove the ‘Maximum of’ prefix.

Weather_Data_Raw <- Weather_Data_Raw%>%
  rename_all(.funs = funs(sub("*Maximum.of.", "", names(Weather_Data_Raw))))
colnames(Weather_Data_Raw)[colnames(Weather_Data_Raw)=="POS.Sales.."]<-"Sales"
Weather_Data_Raw$Date..Year. <- as.factor(Weather_Data_Raw$Date..Year.)
Weather_Data_Raw$WeekOfYear <- as.factor(Weather_Data_Raw$WeekOfYear)
rapply(Weather_Data_Raw,function(x) length(unique(x)))

##    Date..Year.   Date..Month.     WeekOfYear      DayOfWeek           City 
##              4             12             53              7              2 
##            Day           Date    HolidayText          Sales            fog 
##             31           1230             15           1598              2 
##           rain           snow      snow_fall       mtd_snow since_jul_snow 
##              2              2              9             22             40 
##     snow_depth           hail        thunder        tornado      mean_temp 
##             18              2              2              1             97 
##      mean_dewp  mean_pressure  mean_wind_spd  mean_wind_dir     mean_visib 
##             90              3             25            354             11 
##          humid       max_temp       min_temp      max_humid      min_humid 
##              2            105             96             59             84 
##     max_dew_pt     min_dew_pt   max_pressure   min_pressure   max_wind_spd 
##             88             92              3              3             37 
##   min_wind_spd        max_vis        min_vis        gdegree   cooling_days 
##             14             10             11             45             30 
##   heating_days         precip 
##             68              6

act_factors <- names(Weather_Data_Raw)[which(sapply(Weather_Data_Raw,function(x)length(unique(x))==2))]
Weather_Data_Raw[act_factors]<- lapply(Weather_Data_Raw[act_factors],factor)

Currently the data set has data for both Dallas and Chicago. I want to look at Dallas first and check how sales vary over time.

I removed the Date and Day variables since they are redundant. The tornado variable only has one level across the data set so it will not be helpful when modeling.

Dallas <- subset(Dallas, select = -c(Day,Date,tornado))

Next I want to take a look at some descriptive statistics for the numerical data within my data set to make sure it makes sense. Except for the ‘pressure’ variables, which have only 3 unique levels, the data looks normal. I don’t think that the ‘Snow’ variables will be predictive since we are look at Dallas, which doesn’t get snow very often as the data shows.

Dal_num_vars <- names(Dallas)[which(sapply(Dallas,is.numeric))]
Dal_num_df <- Dallas[Dal_num_vars]
Dal_summary_stats_num <- data.frame(vars= colnames(Dal_num_df),
                                    mean=rapply(Dal_num_df, function(x)round(mean(x,na.rm=T),3)),
                                    stddev=rapply(Dal_num_df, function(x)round(sd(x,na.rm=T),3)),
                                    unique=rapply(Dal_num_df, function(x)length(unique(x))),
                                    skew=rapply(Dal_num_df, function(x)round(skewness(x,na.rm=T),3)),
                                    nas=rapply(Dal_num_df, function(x)round(sum(is.na(x)),3)),stringsAsFactors=T)
kable(Dal_summary_stats_num[2:ncol(Dal_summary_stats_num)])%>%
    kable_styling(bootstrap_options = c("striped", "hover"))

	mean	stddev	unique	skew
Sales	1976.093	1784.988	1027	2.830
snow_fall	0.004	0.085	3	21.987
mtd_snow	0.066	0.434	3	6.486
since_jul_snow	0.235	1.045	4	4.281
snow_depth	0.003	0.114	2	34.986
mean_temp	67.560	15.858	71	-0.615
mean_dewp	52.123	16.107	72	-0.662
mean_pressure	29.378	1.751	3	-15.337
mean_wind_spd	10.500	4.069	23	0.564
mean_wind_dir	165.239	94.475	305	0.260
mean_visib	9.401	1.411	11	-3.195
max_temp	77.436	16.397	76	-0.866
min_temp	57.209	16.089	72	-0.424
max_humid	81.541	12.606	57	-1.363
min_humid	42.037	16.877	83	0.561
max_dew_pt	57.293	15.198	70	-0.894
min_dew_pt	46.354	17.277	74	-0.473
max_pressure	29.589	1.753	3	-15.634
min_pressure	29.188	1.727	3	-15.658
max_wind_spd	21.204	6.835	34	0.618
min_wind_spd	3.644	3.733	14	0.656
max_vis	9.954	0.619	6	-14.949
min_vis	7.573	3.546	11	-1.100
gdegree	18.646	13.249	45	0.030
cooling_days	7.928	8.934	30	0.667
heating_days	5.158	8.655	41	1.833
precip	0.111	0.445	6	5.246

Next I am going to take a look at the correlations between each numerical variable.There appears to be a cluster of 7 weather variables that are highly correlated. Heating days seems to be negatively correlated with those 7 variables.

Dal_Corrs <- cor(na.omit(Dal_num_df[]))
Dal_Corr_df <- as.data.frame(Dal_Corrs)
row_indic <- apply(Dal_Corrs, 1, function(x) sum(x > 0.2 | x < -0.2) > 1)
Dal_Corrs <- Dal_Corrs[row_indic ,row_indic ]
corrplot(Dal_Corrs,type = "upper", tl.pos = "td",
         method = "circle", tl.cex = 0.5, tl.col = 'black',
         order = "hclust", diag = FALSE)

Next I want to visualize how these varibles vary with Sales.

doPlots <- function(data_in, fun, ii, ncol=3) {
  pp <- list()
  for (i in ii) {
    p <- fun(data_in=data_in, i=i)
    pp <- c(pp, list(p))
  }
  do.call("grid.arrange", c(pp, ncol=ncol))
}
plotScatter <- function(data_in, i) {
  data <- data.frame(x=data_in[[i]])
  p <- ggplot(data=data, aes(x=x,y=Dallas$Sales)) + geom_point() + xlab(colnames(data_in)[i]) + theme_light() + 
    theme(axis.text.x = element_text(angle = 90, hjust =1),legend.position="none")
  return (p)
}

doPlots(Dallas, fun=plotScatter, ii=c(11:14), ncol=2)

doPlots(Dallas, fun=plotScatter, ii=c(17:22), ncol=2)

doPlots(Dallas, fun=plotScatter, ii=c(23:28), ncol=2)

doPlots(Dallas, fun=plotScatter, ii=c(29:34), ncol=2)

doPlots(Dallas, fun=plotScatter, ii=c(35:39), ncol=2)

Next, I want to create a summary of descriptive statistics for the categorical variables and visualize how they vary with sales.

Dal_fact_vars <- names(Dallas)[which(sapply(Dallas,is.factor))]
Dal_fact_df <- Dallas[Dal_fact_vars]

Dal_summary_stats_fact <- data.frame(vars= colnames(Dal_fact_df),
                                     unique=rapply(Dal_fact_df, function(x)length(unique(x))),
                                     nas=rapply(Dal_fact_df, function(x)round(sum(is.na(x)),3)),
                                     stringsAsFactors=T)

kable(Dal_summary_stats_fact[2:ncol(Dal_summary_stats_fact)])%>%
    kable_styling(bootstrap_options = c("striped", "hover"))

	unique	nas
Date..Year.	4	0
Date..Month.	12	0
WeekOfYear	53	0
DayOfWeek	7	0
City	1	0
HolidayText	15	0
fog	2	0
rain	2	0
snow	2	0
hail	1	0
thunder	2	0
humid	1	0

plotBox <- function(data_in, i) {
  data <- data.frame(x=data_in[[i]])
  p <- ggplot(data=data, aes(x=factor(x),y=Dallas$Sales, fill=factor(x))) + geom_boxplot() + xlab(colnames(data_in)[i]) + theme_light() + 
    theme(axis.text.x = element_text(angle = 90, hjust =1),legend.position="none")
  return (p)
}

doPlots(Dallas, fun=plotBox, ii=c(1:4), ncol=2)

doPlots(Dallas, fun=plotBox, ii=c(8:10), ncol=2)

doPlots(Dallas, fun=plotBox, ii=c(6), ncol=1)

I have to decided to drop variables that do not have any information for modeling purposes i.e. they don’t vary across the dataset. I have also dropped rows that have no sales since these represent holiday when stores are closed.

Dallas <- subset(Dallas, select = -c(City,humid,mean_pressure,min_pressure,max_pressure,precip))
#Drop Holidays with Zero Sales
Dallas <- Dallas%>%
  filter(Sales>0)

Modeling Section

Here I will use Boruta and RFE algorithms to rank my independant variables in terms of importantance. I will later test the predictive powers of the variables chosen by each of these techniques against the full set of data and each other using Lasso Regression and Random Forest Models.

Boruta Analysis

This type of algorithm is quite useful for determining variable importance. This algorithm shuffles the predictors’ values and joins them back with the original predictors. Then it builds random forests on the merged dataset. Next, a comparison of original variables with the randomised variables is made to measure variable importance. Only variables having higher importance than that of the randomised variables are considered important. I use a great package in R called ‘Boruta’ that allows the user to rank the variables in terms of importance.

Based the results below, it appears that the Date variables Year,Week, Month, and Day are 4 of the 6 highest ranked variables. Since Week of Year and Month are redundant, I only want to use one. I will test them against each other later to see which helps predict sales better. Heating days, Gdegree, and Since July Snow are the three highest ranked weather variables.

library(Boruta)

## Loading required package: ranger

TARGET.VAR <- "Sales"
candidate.features <- setdiff(names(Dallas),c(TARGET.VAR))
# response variables
Dallas.boruta.df <- Dallas[candidate.features]
response <- Dallas$Sales
set.seed(13)
per.bor.results <- Boruta(Dallas.boruta.df, response,
                          maxRuns = 101,
                          doTrace = 0)
per.bor.results

Boruta performed 100 iterations in 2.379274 mins. 25 attributes confirmed important: cooling_days, Date..Month., Date..Year., DayOfWeek, gdegree and 20 more; 4 attributes confirmed unimportant: fog, hail, snow_depth, snow_fall; 3 tentative attributes left: max_vis, min_wind_spd, snow;

plot(per.bor.results, xlab = "", xaxt = "n")
k <-lapply(1:ncol(per.bor.results$ImpHistory),function(i)
  per.bor.results$ImpHistory[is.finite(per.bor.results$ImpHistory[,i]),i])
names(k) <- colnames(per.bor.results$ImpHistory)
Labels <- sort(sapply(k,median))
axis(side = 1,las=2,labels = names(Labels),
     at = 1:ncol(per.bor.results$ImpHistory), cex.axis = 0.7)

RFE Analysis

I want to use RFE, or Recursive Feature Elimination, to see what its algorithm ranks as the most important features. There appears to be a good bit of overlap with the Boruta model which is good news. This algorithm ranked WeekOfYear, DayOfWeek, since_jul_snow, max_temp, gdegree as the top 5 most important variables.

Dal_Model_df <- Dallas%>%
  select(-c(Date..Month.))

set.seed(100)
Y <- Dal_Model_df$Sales
drop_sales <- c("Sales") 
X_train <- Dal_Model_df[,!(names(Dal_Model_df) %in% drop_sales)]

##### Use Recursive Feature elimination function in Caret to select best features
control <- rfeControl(functions = rfFuncs,
                      method = "repeatedcv",
                      repeats = 3,
                      verbose = FALSE)                                
Dal_rfe <- rfe(x=X_train,y=Y,
               rfeControl = control)

kable(round(Dal_rfe$result,2))%>%
    kable_styling(bootstrap_options = c("striped", "hover"))

Variables	RMSE	Rsquared	MAE	RMSESD	RsquaredSD	MAESD
4	943.01	0.78	576.42	191.56	0.05	54.72
8	866.71	0.80	524.58	185.80	0.05	55.70
16	804.85	0.83	481.62	186.66	0.04	48.24
31	793.60	0.83	475.42	187.47	0.04	50.31

varImp(Dal_rfe) %>%
  as.data.frame() %>%
  rownames_to_column() %>%
  arrange(Overall) %>% 
  mutate(rowname = forcats::fct_inorder(rowname)) %>%
  ggplot() +
    geom_col(aes(x = rowname, y = Overall)) +
    ylab("Variable Importance")+
    coord_flip()+
    theme(axis.title.y=element_blank())

For the next steps I plan to run 2 types of models (Lasso, Random Forest) for each of the following 3 scenarios to determine which set of variables best predicts Sales.

1) All independent variables 2) The top 6 variables from the Boruta analysis 3) The top 6 variables from the RFE analysis

Based on the results below, the Random Forest model outperformed the Lasso Regression model significantly with a mean R^2 of 82% vs. 71%.

Dal_Model_df <- Dallas%>%
  select(-c(Date..Month.))

Dal_factor_vars <- names(Dal_Model_df)[which(sapply(Dal_Model_df,is.factor))]
Dal_dummies <- dummyVars(~.,Dal_Model_df[Dal_factor_vars])
Dal_1_hot <- predict(Dal_dummies,Dal_Model_df[Dal_factor_vars])

Dal_numeric_vars <- names(Dal_Model_df)[which(sapply(Dal_Model_df,is.numeric))]
Dal_num_df <- Dal_Model_df[Dal_numeric_vars]
Dal_Model_df1 <- cbind(Dal_num_df,Dal_1_hot)
Y_lasso <- Dal_Model_df1$Sales
X_train_lasso <- Dal_Model_df1[,!(names(Dal_Model_df1) %in% drop_sales)]


CARET.TRAIN.CTRL <- trainControl(method="repeatedcv",
                                 number=5,
                                 repeats=5,
                                 verboseIter=FALSE)
##### Lasso #####
Dal_lasso_1 <- train(x=X_train_lasso,y=Y_lasso,
                     method="glmnet",
                     metric="RMSE",
                     maximize=FALSE,
                     trControl=CARET.TRAIN.CTRL,
                     tuneGrid=expand.grid(alpha=1,lambda=seq(0, 100, by = 0.1)))


##### Random Forest #####
tg <- data.frame(mtry= seq(2,30, by=5))
Dal_rf_1 <- train(x=X_train,y=Y,
                  method="rf",
                  metric="RMSE",
                  maximize=FALSE,
                  trControl=CARET.TRAIN.CTRL,
                  tuneGrid = tg)


##### Comparing Models #####
model_list <- list(lasso1=Dal_lasso_1,rf1=Dal_rf_1)
res<-resamples(model_list)

res_table <-as.data.frame(summary(res)$statistics)

	RMSE.Min.	RMSE.1st.Qu.	RMSE.Median	RMSE.Mean	RMSE.3rd.Qu.	RMSE.Max.	RMSE.NA.s
lasso1	827.04	881.64	930.87	957.34	1021.80	1196.92	0
rf1	598.58	661.04	732.47	783.25	834.22	1195.26	0

	Rsquared.Min.	Rsquared.1st.Qu.	Rsquared.Median	Rsquared.Mean	Rsquared.3rd.Qu.	Rsquared.Max.	Rsquared.NA.s
lasso1	0.60	0.69	0.73	0.71	0.75	0.77	0
rf1	0.72	0.81	0.82	0.82	0.85	0.87	0

The results below show how the Lasso vs. Random Forest performed with the top 9 variables from the Boruta analysis. Based on theses results, the Random Forest model outperformed the Lasso Regression model significantly with a mean R^2 of 82% vs. 71%.

##### Use Variables from Boruta model to build a models
Dal_Model_df <- Dallas%>%
  select(c(Sales,WeekOfYear,since_jul_snow, Date..Year., heating_days, DayOfWeek,HolidayText,mean_temp,min_humid,max_dew_pt))

Dal_factor_vars <- names(Dal_Model_df)[which(sapply(Dal_Model_df,is.factor))]
Dal_dummies <- dummyVars(~.,Dal_Model_df[Dal_factor_vars])
Dal_1_hot <- predict(Dal_dummies,Dal_Model_df[Dal_factor_vars])

Dal_numeric_vars <- names(Dal_Model_df)[which(sapply(Dal_Model_df,is.numeric))]
Dal_num_df <- Dal_Model_df[Dal_numeric_vars]
Dal_Model_df1 <- cbind(Dal_num_df,Dal_1_hot)

Y <- Dal_Model_df$Sales
Y_lasso <- Dal_Model_df1$Sales
drop_sales <- c("Sales") 
X_train <- Dal_Model_df[,!(names(Dal_Model_df) %in% drop_sales)]
X_train_lasso <- Dal_Model_df1[,!(names(Dal_Model_df1) %in% drop_sales)]

set.seed(100)
CARET.TRAIN.CTRL <- trainControl(method="repeatedcv",
                                 number=5,
                                 repeats=5,
                                 verboseIter=FALSE)
##### Lasso #####
Dal_lasso_2 <- train(x=X_train_lasso,y=Y_lasso,
                     method="glmnet",
                     metric="RMSE",
                     maximize=FALSE,
                     trControl=CARET.TRAIN.CTRL,
                     tuneGrid=expand.grid(alpha=1,lambda=seq(0, 100, by = 0.1)))

##### Random Forest #####
tg <- data.frame(mtry= seq(2,10, by=2))
Dal_rf_2 <- train(x=X_train,y=Y,
                  method="rf",
                  metric="RMSE",
                  maximize=FALSE,
                  importance = TRUE,
                  trControl=CARET.TRAIN.CTRL,
                  tuneGrid = tg)


##### Comparing new model with Boruta features 
model_list2 <- list(lasso2=Dal_lasso_2,rf2=Dal_rf_2)
res2<-resamples(model_list2)
res2_table <-as.data.frame(summary(res2)$statistics)

	RMSE.Min.	RMSE.1st.Qu.	RMSE.Median	RMSE.Mean	RMSE.3rd.Qu.	RMSE.Max.	RMSE.NA.s
lasso2	731.52	897.65	965.50	986.15	1061.58	1225.57	0
rf2	528.38	647.83	764.86	782.63	919.65	1074.78	0

	Rsquared.Min.	Rsquared.1st.Qu.	Rsquared.Median	Rsquared.Mean	Rsquared.3rd.Qu.	Rsquared.Max.	Rsquared.NA.s
lasso2	0.63	0.68	0.70	0.70	0.73	0.76	0
rf2	0.72	0.81	0.83	0.82	0.83	0.87	0

The results below show how the Lasso vs. Random Forest models performed with the top 9 variables from the RFE analysis versus the previous 4 model above. Based on theses results, the Random Forest model outperformed the Lasso Regression model significantly with a mean R^2 of 82% vs. 71%.The results also point to the fact that a reduction in the number of variables used doesn’t have a significant effect in the predictive power of the Random Forest model. We can see that the mean RMSE and mean R^2 values decreasing very slightly from the full set of variables to the variables chosen by the Boruta and RFE algorithms.

##### Use Variables from RFE to build a Models
Dal_Model_df <- Dallas%>%
  select(c(Sales,since_jul_snow, Date..Year.,DayOfWeek,Date..Month.,max_temp,gdegree, min_humid,HolidayText,heating_days,max_dew_pt))

Dal_factor_vars <- names(Dal_Model_df)[which(sapply(Dal_Model_df,is.factor))]
Dal_dummies <- dummyVars(~.,Dal_Model_df[Dal_factor_vars])
Dal_1_hot <- predict(Dal_dummies,Dal_Model_df[Dal_factor_vars])

Dal_numeric_vars <- names(Dal_Model_df)[which(sapply(Dal_Model_df,is.numeric))]
Dal_num_df <- Dal_Model_df[Dal_numeric_vars]
Dal_Model_df1 <- cbind(Dal_num_df,Dal_1_hot)

Y <- Dal_Model_df$Sales
Y_lasso <- Dal_Model_df1$Sales
drop_sales <- c("Sales") 
X_train <- Dal_Model_df[,!(names(Dal_Model_df) %in% drop_sales)]
X_train_lasso <- Dal_Model_df1[,!(names(Dal_Model_df1) %in% drop_sales)]

##### Lasso #####
set.seed(100)
CARET.TRAIN.CTRL <- trainControl(method="repeatedcv",
                                 number=5,
                                 repeats=5,
                                 verboseIter=FALSE)
Dal_lasso_3 <- train(x=X_train_lasso,y=Y_lasso,
                     method="glmnet",
                     metric="RMSE",
                     maximize=FALSE,
                     trControl=CARET.TRAIN.CTRL,
                     tuneGrid=expand.grid(alpha=1,lambda=seq(0, 100, by = 0.1)))

##### Random Forest #####
tg <- data.frame(mtry= seq(1,10, by=2))
Dal_rf_3 <- train(x=X_train,y=Y,
                  method="rf",
                  metric="RMSE",
                  maximize=FALSE,
                  importance = TRUE,
                  trControl=CARET.TRAIN.CTRL,
                  tuneGrid = tg)


model_list5 <- list(lasso1=Dal_lasso_1,lasso2=Dal_lasso_2,lasso3=Dal_lasso_3, 
                    rf1=Dal_rf_1,rf2=Dal_rf_2,rf3=Dal_rf_3)
res5<-resamples(model_list5)

res5_table <-as.data.frame(summary(res5)$statistics)

	RMSE.Min.	RMSE.1st.Qu.	RMSE.Median	RMSE.Mean	RMSE.3rd.Qu.	RMSE.Max.
lasso1	827.04	881.64	930.87	957.34	1021.80	1196.92
lasso2	731.52	897.65	965.50	986.15	1061.58	1225.57
lasso3	718.19	925.72	1021.49	1023.59	1107.34	1309.14
rf1	598.58	661.04	732.47	783.25	834.22	1195.26
rf2	528.38	647.83	764.86	782.63	919.65	1074.78
rf3	563.08	673.85	796.30	769.26	837.02	1011.40

	Rsquared.Min.	Rsquared.1st.Qu.	Rsquared.Median	Rsquared.Mean	Rsquared.3rd.Qu.	Rsquared.Max.
lasso1	0.60	0.69	0.73	0.71	0.75	0.77
lasso2	0.63	0.68	0.70	0.70	0.73	0.76
lasso3	0.58	0.65	0.68	0.68	0.71	0.73
rf1	0.72	0.81	0.82	0.82	0.85	0.87
rf2	0.72	0.81	0.83	0.82	0.83	0.87
rf3	0.77	0.80	0.82	0.82	0.84	0.88

Return to my portfolio

Dustin Rogers