Kaggle X Avocado Prices (2/2)

6 min readOct 14, 2018

To recall what we’ve done about avocado prices, please check the previous post here.

So… a quick recap about what assumptions we have last time.

The AveragePrice varies in regions(this may be inferred as region plays a critical role in predicting AveragePrice), and the AveragePrice of conventional avocado was getting more expensive from 2015 to 2018 regardless of regions.
Organic avocados are more expensive than conventional ones.
The AveragePrice of avocados is affected by years, regions, types.

With the above three assumptions, let’s build a couple of models to see how different machine/statistical learning models predict the average prices of avocado!

Step#1 Non-numerical data conversion

Before building models, we need to convert non-numerical data into numerical ones. Computers are not as sophisticated as human beings, so we need to help them a bit to understand non-numerical data by converting them into dummies or categorical variables. I’ll show both ways to make you get the ideas about how they work.

Convert data into dummy variables.

In this case, I choose to convert the column “type” into dummies, as dummy variables work the best when there are binary values in the variable. You can check the code and the converted results with the red rectangle.

Code for dummy variables

Figure 1. Convert type into dummy variables

2. Convert data into categorical variables

Just to give you the idea about how the data in the column “region” looks like.

Code for visualizing regions

Plot.

Figure 2. Visualization of the column region

As you can see in Figure 3 above, the data among 54 regions are pretty even, though there is a minor difference in WestTexNewMexico. Now we need to convert this column into a categorical variable.

Here, I used a pretty easy two-line code to solve this.

First, I convert the data type into categorical data. However, it does not mean the data has been treated as numerical data, it only means we successfully converted the data into categorical with its original data type (in region case, it’s still object/string type, for now)

Figure 3. Convert region into a categorical variable

Next, we encode the categorical data with a numerical format. After doing so, we now have the categorical data for the column region, and have it encoded with the numercial format!

Figure 4. Convert region to numerical categorical data

Code for categorical data conversion

Step#2 DateTime data conversion

Both of us and the computers can easily read the DateTime (Of course, in different ways of interpretation); however, we still need some conversion to make the DateTime related columns in the right scale to be used in the model.

P.S. In this dataset, we have columns “year” and “Date”, it is fine for us to treat year as a numerical data, but we still need to scale Date to more reasonable level.

In this case, I’ll re-scale Date into quarters.

Figure 5. Convert column Date into quarters

Now we can check the new correlation matrix.

Step#3 Building models

Before building models, we need to set the target/dependent variable (denoted as Y), which is the column “AveragePrice” in the dataset. In other words, take other columns as the input/independent variables (denoted as X)as an input, and then come up with a function in order to estimate the average of AveragePrice given specific input variables.

Code for splitting X and Y

From the previous steps, we’ve already converted all the datatypes into numerical format. Now, we have to split the data into train and test dataset. In this example, I randomly choose 1/3 of the data as test data, and the rest as train data.

Code for splitting train and test

Now, I’m going to demonstrate several models to see the results. First, I’ll use multiple linear regression model as a baseline model to give y’all an idea of how a machine/statistical learning model looks like. Then, I’ll introduce the feature selection function to help us get a more accurate model.

A multiple regression model with all input variables

Code for multiple linear regression w/ all input variables

Here’s the result for linear regression model generated with all input variables.

Figure 7. Multiple linear regression with all input variables

I’m just picking some important columns in Figure 6 to explain.

R-squared (R²): Explains how well the model fits the data with a range of 0 -1, the higher the R² is, the better the model fits the data. (Concerns: If the R² is really close to 1, we may think there is an overfitting issue.)
F-Statistic: The larger the F-Statistic gets, the better the model performs. Normally we take the number of data points into consideration when we decide how large the F-sats is the proof of the model is good at prediction.
coef: how many the estimated average of the output variable will change the corresponding variable X changes by a unit.
t: the t-score in this case, explains how many standard deviations away from the mean with regard to the input variable.
P > |t|: the p-value, explains how strong the X relats to Y.

As you can see here, not all the variables are significantly contributed to the output variable. Hence, let’s do the feature selection to see what variables really contribute to AveragePrice of avocados.

(a) Model Selection: based on the results of the baseline model

We can simply choose the input variables which have a significant relationship with Y. Based on Figure 6, we noticed that there are 4 variables having significant relationships with Y, which are ‘conventional’, ‘organic’, ‘Date_Q’, and ‘year’. Hence, we rebuilt the regression model to see how the revised model looks like.

Figure 8. Multiple linear regression with 4 features

As you can see in Figure 7, the R² is slower than the baseline (which means the model is worse than the baseline when fitting the data), but the good signs are F-statistic is larger, and all the variables have significant relationships with Y (AveragePrice).

(b) Model Selection: f_regression

Other than referencing the baseline model, we can use other functions to select representative variables. For example, the f_regression function in Sci-Kit Learn returns p-value and F-stats of each variable.

Code for f_regression

Here are the results of f_regression analysis.

And we can re-build the regression model based on the top 4 informative variables: ‘conventional’, ‘organic’, ‘4046’, ‘Total Volume’.

Figure 10. Regression model of f_regression analysis

Though the four features have significant relationships with Y, the F-statistic is lower than Figure 9, and the R² is lower than the previous two models.

(For other 3 multiple regression models I’ve tried, you can check my Kaggle code here.)

Now, we may think that the multiple regression model may not be the ideal model to estimate Y. I decided to use other machine learning models, such as XGBoost to predict the average prices of avocados.

Code for XGBoosted Regression

As R² shown in Figure 11, the R² we obtained from the XGBoosted regression model is 0.89, obviously higher than previous models. Also, the model also explains 87.80% variance in the data. Hence, we chose the XGBoosted regression model as our winning solution!

P.S. You can check the full version of the code on my GitHub.

Kaggle X Avocado Prices (2/2)

Written by Chia-Hui (Alice) Liu