Kaggle X Avocado Prices (1/2)

7 min readSep 1, 2018

--

This is for data science taking Kaggle dataset — Avocado Prices as demonstration. After finishing reading this blog, you'll be able to know...

How to describe data with Python.
How to choose which machine/statistical learning model to be used by data distribution.
How to evaluate the performance of the model.

As there are a lot of details needed to be gone through in the aforementioned three parts, I'll divide it into three parts and this part will be talking about how to use Python to describe the data.

Required packages:
Pandas, OS, matplotlib, seaborn (using pip install to have these packages installed)
Column descriptions of the dataset (Avocado prices)
- 4046: Small Hass
- 4225: Large Hass
- 4770: Extra Large Hass

Step 1. Read the .csv file
Once you imported the package-Pandas, you'll be able to use all the functions under this pacakge. The very first function you should use to get the data from the csv file is definitely read_csv().

Once you use the read_csv() to import your data as a dataframe (df), you can utilize some other functions to help you understand the data.

Step 2. Simple Data Exploratory with Pandas

For example, by sample(n), you can randomly choose n samples from the dataframe (df). If the n is not given, the default value would be 1.

Also, the function dtypes() helps you get data types of each column.

The function info() is used to check the data types and the null counts in columns.
As the screenshot showed below, there are 18,249 data points in the file, and none of the columns has null value (which is nice). In more straightforward way, you can use the functions isull().sum() to count the null values in each column.

Once we know there are no null values in the data, we can go further to deep-dive the data.

For instance, use describe() to help you get information on numerical data.

On the other hand, for non-numerical columns, such as textual columns, you can use value_counts() to see the data. (Please note, the function value_counts() can be used in numerical columns as well.)

In this case, I'll just use one of the textual column-- "", to demonstrate the function.

Okay! I hope you are still with me right now. Now you've learned some basic functions in Pandas to help you do the analysis on the data.

Now we are going to make the analysis more intuitive to people. We'll use visualizations to demonstrate the distribution of the data!

Before going to analyze the data, we need to change the datatype for the column "Date" as now it is Object type, we need to convert it to datetime to conduct some time series analysis. To convert string (Object) to datetime, we use to_datetime() to do it.

Since our final target is to predict/analyze price, our visualizations will be focusing on the relationship between columns with the column "AveragePrice".

To demonstrate the visualization, I mainly use the two packages: seaborn, and the matplotlib.

The distribution of the column "AveragePrice"

Code:

Plot:

Explanation: It seems like the distribution of the AveragePrice of avocado is mostly falling between 1.0 and 1.5 with nearly a normal distribution. We can also use boxplot to see the variance of the AveragePrice of Avocado!

Code:

Plot:

Explanation:
Now you can see there are some data points fall outside the box plot (on the right side of Figure 12). There are tons of explanations for these outliers, so let's get one step further to see how AveragePrice relates with other parameters!

2. Type v.s. Average

As you know for sure, organic things are always more expensive than non-organic ones. Let's see if this rule applies with avocado!

Code:

Plot:

Explanation:
Our assumption is true! There are 2 types of avocado-conventional and organic. Overall speaking, the average price of organic avocado (which is around 1.6) is higher than the one of conventional avocado (around 1.2). However, as you can still see there are some outliers in both conventional and organic avocado, so let's keep checking other parameters !

3. Conventional Avocado X Region X Year

So... let's see the AveragePrice distribution of Converntional Avocado regarding to different region and year!

Code:
(P.S. I sorted by AveragePrice to make the visualization more readable.)

Plot:

Explanation:
Factor plot gives you more information now, you may come across these information.

(1) PhoenixTucson has the least AveragePrice of conventional avocado from 2015 to 2018.
(2) BuffaloRochester has the most AveragePrice of conventional avocado from 2015 to 2018.
(3) The region having the most variance in AveragePrice through 2016 to 2015 is GrandRapids.
(4) Overall conclusion: the AveragePrice varies in regions(this may be inferred as region plays a critical role in predicting AveragePrice), and the AveragePrice of conventional avocado was getting more expensive from 2015 to 2018 regardless of regions.

4. Organic Avocado X Region X Year

Okay! Let's see if there any other interesting information laying in organic avocado in terms of regions and year.

Code.

Plot.

Explanation:
Now it's your turn to try it out! Based on Figure 18. answering the following questions.

(1) Which region has the least average price through 2015 to 2018?
(2) Which region has the most AveragePrice from 2015 to 2018?
(3) Which region has the most variance in price during 2015 to 2018?
(4) Taking Figure 16. into consideration, what you may conclude?

In addition, we can also check the overall AveragePrice aggregated by year.

Code:

Plot:

Explanation:
Obviously, the organic avocado is more expensive than conventional one. However, the shadows for the two types may indicate that the average price differs a lot in regions and years.

Finally, we kind of know the distribution of the data regarding to each column, and now we focus on checking the correlations among columns.

Just a quick reminder, the reason why we check the correlations among columns is to imply the hints of the relationship of variables. Please be noted, correlation is not always the causation. If two variables have high correlation, we may just conclude that there may be some relationship between the two, but no 100% sure. For more clarifications, please read the article "Causation vs. Correlation."

You can based on the heat map to explain possible relationship between variables. This is important because knowing the possible relationship among variables would help us on deciding which model we should build. So why not start trying to see the relationship among variables!

Congrats! Now we finished all the data analysis and I believe you have more information of the data now. Next article will introduce you how to build your own model to predict the price of avocado.

Feel free to leave comments below, I'm happy to discuss with you!

Reference.
The original dataset: https://www.kaggle.com/neuromusic/avocado-prices

Kaggle X Avocado Prices (1/2)

Written by Chia-Hui (Alice) Liu