467 min read

White Wine Quality

Loading libraries

library(tidyverse)
library(plotly)
library(formattable)
library(DT)
library(RColorBrewer)
library(stringr)

Sample data

Checking dimensions and features.

## 4898 Items
##  13 Fields
## Names feature set:
## $integer
## [1] "X"       "quality"
## 
## $numeric
##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"

Structure

## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

summary

##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000

Description of attributes:

  • fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
  • volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
  • citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines
  • residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
  • chlorides: the amount of salt in the wine
  • free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
  • total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
  • density: the density of water is close to that of water depending on the percent alcohol and sugar content
  • pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
  • sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
  • alcohol: the percent alcohol content of the wine
  • quality (score between 0 and 10)

Setting up custom functions and styles

# setting up plotly label, axis and text customizations
f1 <- list(
  family = "Old Standard TT, serif",
  size = 14,
  color = "grey"
)
f2 <- list(
  family = "Old Standard TT, serif",
  size = 10,
  color = "black"
)
a <- list(
  titlefont = f1,
  showticklabels = T,
  tickangle = -45,
  tickfont = f2
)

m <- list(
  l = 50,
  r = 50,
  b = 100,
  t = 100,
  pad = 4
)

# simple histogram
hist_plot <- function(x, bwidth, 
                      xlabel, title, 
                      fill = NULL, 
                      color = NULL){
  if(is.null(fill)) fill <- 'orange'
  if(is.null(color)) color <- 'black'
  
  hp <- ggplot(data = data, mapping = aes(x = x))
  gp <- hp + geom_histogram(binwidth = bwidth, 
                            fill = fill, 
                            color = color,
                            size = 0.2,
                            alpha = 0.7,
                            show.legend = F) +
    xlab(xlabel) +
    ggtitle(title) +
    theme_minimal() +
    theme(legend.position = 'none',
          plot.title = element_text(family = 'Georgia',
                                    color = 'darkgrey',
                                    size = 14))
  
  ggplotly(gp) %>%
    layout(margin = m,
           xaxis = a,
           yaxis = a)
                      }

Univariate Analysis

In this section I am going to explore distributions for features such as quality, residual sugar, volatitle acidity, pH and more.

Plotting distribution for quality

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.878   6.000   9.000

Plotting distribution of volatile acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000

Plotting distribution of pH

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

Plotting distribution of total sulphur dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0

Plotting distribution of alcohol

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

Plotting distribution of residual sugar

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

Plotting distribution of chlorides

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

Plotting distribution of density

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

Plotting distribution of sulphates

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800

Bivariate Analysis

In this section I will check for trends and correlations between features. Comparing alcohol content according to wine quality seems like a interesting comparision to begin with.

First, i’ll create custom function that I am going to use in this section.

box_plot <- function(x, y, xlabel, ylabel, title){
  gp <- ggplot(data = data, aes(x = x, y = y, fill = factor(quality)))
  bp <- gp + geom_boxplot(show.legend = F,
                          alpha = 0.7) +
    stat_summary(fun.y = mean, geom = 'point',
                 shape = 23, show.legend = F) +
    xlab(xlabel) +
    ylab(ylabel) +
    ggtitle(title) +
    theme_minimal() +
    theme(legend.position = 'none',
          plot.title = element_text(family = 'Georgia',
                                    size = 14,
                                    color = 'darkgrey')) +
    scale_fill_brewer(palette = 'Spectral') +
    coord_flip()
  
  ggplotly(bp) %>%
    layout(margin = m,
           xaxis = a,
           yaxis = a)
}

scatter_plot <- function(x, y, xlabel, ylabel, title, alpha = NULL){
  if(is.null(alpha)) alpha <- 0.5
  gp <- ggplot(data = data, aes(x = x, y = y))
  sp <- gp + geom_jitter(shape = 21,
                         alpha = alpha,
                         stroke = 0.2) +
    xlab(xlabel) +
    ylab(ylabel) +
    ggtitle(title) +
    theme_minimal() +
    theme(plot.title = element_text(family = 'Georgia',
                                    size = 14,
                                    color = 'darkgrey'))
  
  ggplotly(sp) %>%
    layout(margin = m, 
           xaxis = a,
           yaxis = a)
}

Plotting distribution of alcohol on quality wise basis

Checking of spread of alcohol values across all quality of wines.

The table below shows the summary of alcohol content based on wine quality

Visualizing the table above. The plot shows a scatter plot of quality vs alcohol with added summary of mean(red line), median(blue line) and 25th and 75th quantile (dotted lines).

There seems to be a postive realtionship between quality of wine and alcohol content. Better quality wines seem to have more alcohol content.

Now, lets check correlation between density and alcohol

# checking pearson's r for density vs alcohol
cor(data$density, data$alcohol, method = 'pearson')
## [1] -0.7801376

Density seems to show a strong negative correlation with alcohol. Meaning, wines with less alcohol content are more dense. Lets also check how are density values dispersed across different quality of wines.

Checking frequency of density separated by quality of wine

Better quality wines show less density. The table below summarises the range of density for different quality of wines.

The line plot below visualises the quality vs density quite well and shows clearly that better quality wines are less dense. (red line is mean desnity, blue line is median density, dotted lines are 25th quantile and 75th quantile)

Now lets move on to examine correlation between Alcohol vs Residual Sugar.

# pearson's r for alcohol vs density
cor(data$residual.sugar, data$alcohol, method = 'pearson')
## [1] -0.4506312

Residual sugar and density seem to show a statistically meaningfull negative correlation between them. In other words, as sugar content in wines increases alcohol content decreases.

Lets check how residual sugar values are spread across different quality of wines.

Summarising residual sugar values by quality in a datatable.

Residual Sugar value trends decline as the quality of wine increases. Meaning that Better quality wines seem to be less sweeter.

Now lets check if alcohol and total sulfur dioxide share a relationship with each other.

# pearson's r for alcohol vs total sulfur dioxide
cor(data$total.sulfur.dioxide, data$alcohol, method = 'pearson')
## [1] -0.4488921

Total sulfur dioxide and alcohol show a meaningful negative relationship between each other. Wines with more alcohol seem to have lesser total sulfur dioxide.

Lets check how does quality of wine affect total sulfur dioxide content. We can see how total so2 is distributed for each quality of wine through the frequency plot.

Checking range of Total SO2 values for each quality of wine.

Summarising the mean, median, min, max values for Total SO2 on wine quality basis.

Summary

Density and Alcohol showed the strongest correlation among all other variables. Their r score was -0.78.