Loading libraries
library(tidyverse)
library(plotly)
library(formattable)
library(DT)
library(RColorBrewer)
library(stringr)
Sample data
Checking dimensions and features.
## 4898 Items
## 13 Fields
## Names feature set:
## $integer
## [1] "X" "quality"
##
## $numeric
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol"
Structure
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
summary
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
Description of attributes:
- fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
- volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
- citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines
- residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
- chlorides: the amount of salt in the wine
- free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
- total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
- density: the density of water is close to that of water depending on the percent alcohol and sugar content
- pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
- sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
- alcohol: the percent alcohol content of the wine
- quality (score between 0 and 10)
Setting up custom functions and styles
# setting up plotly label, axis and text customizations
f1 <- list(
family = "Old Standard TT, serif",
size = 14,
color = "grey"
)
f2 <- list(
family = "Old Standard TT, serif",
size = 10,
color = "black"
)
a <- list(
titlefont = f1,
showticklabels = T,
tickangle = -45,
tickfont = f2
)
m <- list(
l = 50,
r = 50,
b = 100,
t = 100,
pad = 4
)
# simple histogram
hist_plot <- function(x, bwidth,
xlabel, title,
fill = NULL,
color = NULL){
if(is.null(fill)) fill <- 'orange'
if(is.null(color)) color <- 'black'
hp <- ggplot(data = data, mapping = aes(x = x))
gp <- hp + geom_histogram(binwidth = bwidth,
fill = fill,
color = color,
size = 0.2,
alpha = 0.7,
show.legend = F) +
xlab(xlabel) +
ggtitle(title) +
theme_minimal() +
theme(legend.position = 'none',
plot.title = element_text(family = 'Georgia',
color = 'darkgrey',
size = 14))
ggplotly(gp) %>%
layout(margin = m,
xaxis = a,
yaxis = a)
}
Univariate Analysis
In this section I am going to explore distributions for features such as quality, residual sugar, volatitle acidity, pH and more.
Plotting distribution for quality
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.878 6.000 9.000
Plotting distribution of volatile acidity
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000
Plotting distribution of pH
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
Plotting distribution of total sulphur dioxide
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
Plotting distribution of alcohol
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
Plotting distribution of residual sugar
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
Plotting distribution of chlorides
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
Plotting distribution of density
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
Plotting distribution of sulphates
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800
Bivariate Analysis
In this section I will check for trends and correlations between features. Comparing alcohol content according to wine quality seems like a interesting comparision to begin with.
First, i’ll create custom function that I am going to use in this section.
box_plot <- function(x, y, xlabel, ylabel, title){
gp <- ggplot(data = data, aes(x = x, y = y, fill = factor(quality)))
bp <- gp + geom_boxplot(show.legend = F,
alpha = 0.7) +
stat_summary(fun.y = mean, geom = 'point',
shape = 23, show.legend = F) +
xlab(xlabel) +
ylab(ylabel) +
ggtitle(title) +
theme_minimal() +
theme(legend.position = 'none',
plot.title = element_text(family = 'Georgia',
size = 14,
color = 'darkgrey')) +
scale_fill_brewer(palette = 'Spectral') +
coord_flip()
ggplotly(bp) %>%
layout(margin = m,
xaxis = a,
yaxis = a)
}
scatter_plot <- function(x, y, xlabel, ylabel, title, alpha = NULL){
if(is.null(alpha)) alpha <- 0.5
gp <- ggplot(data = data, aes(x = x, y = y))
sp <- gp + geom_jitter(shape = 21,
alpha = alpha,
stroke = 0.2) +
xlab(xlabel) +
ylab(ylabel) +
ggtitle(title) +
theme_minimal() +
theme(plot.title = element_text(family = 'Georgia',
size = 14,
color = 'darkgrey'))
ggplotly(sp) %>%
layout(margin = m,
xaxis = a,
yaxis = a)
}
Plotting distribution of alcohol on quality wise basis
Checking of spread of alcohol values across all quality of wines.
The table below shows the summary of alcohol content based on wine quality
Visualizing the table above. The plot shows a scatter plot of quality vs alcohol with added summary of mean(red line), median(blue line) and 25th and 75th quantile (dotted lines).
There seems to be a postive realtionship between quality of wine and alcohol content. Better quality wines seem to have more alcohol content.
Now, lets check correlation between density and alcohol
# checking pearson's r for density vs alcohol
cor(data$density, data$alcohol, method = 'pearson')
## [1] -0.7801376
Density seems to show a strong negative correlation with alcohol. Meaning, wines with less alcohol content are more dense. Lets also check how are density values dispersed across different quality of wines.
Checking frequency of density separated by quality of wine
Better quality wines show less density. The table below summarises the range of density for different quality of wines.
The line plot below visualises the quality vs density quite well and shows clearly that better quality wines are less dense. (red line is mean desnity, blue line is median density, dotted lines are 25th quantile and 75th quantile)
Now lets move on to examine correlation between Alcohol vs Residual Sugar.
# pearson's r for alcohol vs density
cor(data$residual.sugar, data$alcohol, method = 'pearson')
## [1] -0.4506312
Residual sugar and density seem to show a statistically meaningfull negative correlation between them. In other words, as sugar content in wines increases alcohol content decreases.
Lets check how residual sugar values are spread across different quality of wines.
Summarising residual sugar values by quality in a datatable.
Residual Sugar value trends decline as the quality of wine increases. Meaning that Better quality wines seem to be less sweeter.
Now lets check if alcohol and total sulfur dioxide share a relationship with each other.
# pearson's r for alcohol vs total sulfur dioxide
cor(data$total.sulfur.dioxide, data$alcohol, method = 'pearson')
## [1] -0.4488921
Total sulfur dioxide and alcohol show a meaningful negative relationship between each other. Wines with more alcohol seem to have lesser total sulfur dioxide.
Lets check how does quality of wine affect total sulfur dioxide content. We can see how total so2 is distributed for each quality of wine through the frequency plot.
Checking range of Total SO2 values for each quality of wine.
Summarising the mean, median, min, max values for Total SO2 on wine quality basis.
Summary
Density and Alcohol showed the strongest correlation among all other variables. Their r score was -0.78.