Image credit: Youtube

Cocktail recipes analysis

Introduction

For this first blog post I thought I’d take a recent #TidyTuesday dataset and do some analysis of it. Where better to start than one of my favourite things? Cocktails! You can find the dataset and information about it here. Basically it’s a list of cocktail recipes from the most popular drink-mixing guide in America, called the “Mr Boston Offical Bartender’s Guide”, which first published in 1935. This is from the latest edition.

Firstly I’d like to acknowledge Julia Silge and David Robinson, whose approaches to analysing this dataset I enjoyed. In this post I’ve taken some of their ideas and then developed them for my own purposes.

I will first clean the data, do some exploratory analysis, then apply some unsupervised learning techniques to see what patterns we can find.

Data pre-processing

Let’s load the cocktails receipes dataset and take a quick look at it:

boston_cocktails <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-26/boston_cocktails.csv")

boston_cocktails %>% head()
## # A tibble: 6 x 6
##   name         category        row_id ingredient_number ingredient       measure
##   <chr>        <chr>            <dbl>             <dbl> <chr>            <chr>  
## 1 Gauguin      Cocktail Class…      1                 1 Light Rum        2 oz   
## 2 Gauguin      Cocktail Class…      1                 2 Passion Fruit S… 1 oz   
## 3 Gauguin      Cocktail Class…      1                 3 Lemon Juice      1 oz   
## 4 Gauguin      Cocktail Class…      1                 4 Lime Juice       1 oz   
## 5 Fort Lauder… Cocktail Class…      2                 1 Light Rum        1 1/2 …
## 6 Fort Lauder… Cocktail Class…      2                 2 Sweet Vermouth   1/2 oz

So basically on each line we have the name of the cocktail, its category, then each ingredient in it and the amount of it required. There are 989 drinks in this dataset featuring 569 unique ingredients.

Let’s see what the 20 most common ingredients are:

boston_cocktails %>%
  count(ingredient, sort = TRUE) %>%
  slice(1:20)
## # A tibble: 20 x 2
##    ingredient            n
##    <chr>             <int>
##  1 Gin                 176
##  2 Fresh lemon juice   138
##  3 Simple Syrup        115
##  4 Vodka               114
##  5 Light Rum           113
##  6 Dry Vermouth        107
##  7 Fresh Lime Juice    107
##  8 Triple Sec          107
##  9 Powdered Sugar       90
## 10 Grenadine            85
## 11 Sweet Vermouth       83
## 12 Brandy               80
## 13 Lemon Juice          70
## 14 Blanco tequila       69
## 15 Egg White            67
## 16 Angostura Bitters    63
## 17 Juice of a Lemon     60
## 18 Pineapple Juice      47
## 19 Bourbon whiskey      38
## 20 Orange Bitters       38

Look at entries 2, 13 and 17. It’s all just lemon juice with different names! We need to do some data cleaning. In the next section we modify the ingredient column to deal with various issues described in the in-line comments.

# Adapted from Julia Silge's code
cocktails_parsed <- boston_cocktails %>%
  mutate(
    ingredient = str_to_lower(ingredient),  # make everything lower case
    ingredient = str_replace_all(ingredient, "-", " "), # Remove dashes
    # Remove some stop words
    ingredient = str_remove(ingredient, " liqueur$"),
    ingredient = str_remove(ingredient, " (if desired)$"),
    # Consolidate all types of bitters into 'bitters', lemons to 'lemon juice', whiskies to 'whiskey', etc
    ingredient = case_when(
      str_detect(ingredient, "bitters") ~ "bitters",
      str_detect(ingredient, "lemon") ~ "lemon juice",
      str_detect(ingredient, "lime") ~ "lime juice",
      str_detect(ingredient, "grapefruit") ~ "grapefruit juice",
      str_detect(ingredient, "orange") ~ "orange juice",
      str_detect(ingredient, "whiskey") ~ "whiskey",
      str_detect(ingredient, "rum") ~ "rum",
      TRUE ~ ingredient
    ),
    # Looks like units of bitters are erroneously given as ounces - change this to 'dash'
    measure = case_when(
      str_detect(ingredient, "bitters") ~ str_replace(measure, "oz$", "dash"),
      TRUE ~ measure
    ),
    # Change fractions to decimals in the measure column
    measure = str_replace(measure, " ?1/2", ".5"),
    measure = str_replace(measure, " ?3/4", ".75"),
    measure = str_replace(measure, " ?1/4", ".25"),
    # parse the number out of the measure, we can use this later
    measure_number = parse_number(measure),
    measure_number = if_else(str_detect(measure, "dash$"),
      measure_number / 50,
      measure_number
    )
  )
cocktails_parsed
## # A tibble: 3,643 x 7
##    name    category   row_id ingredient_numb… ingredient  measure measure_number
##    <chr>   <chr>       <dbl>            <dbl> <chr>       <chr>            <dbl>
##  1 Gauguin Cocktail …      1                1 rum         2 oz              2   
##  2 Gauguin Cocktail …      1                2 passion fr… 1 oz              1   
##  3 Gauguin Cocktail …      1                3 lemon juice 1 oz              1   
##  4 Gauguin Cocktail …      1                4 lime juice  1 oz              1   
##  5 Fort L… Cocktail …      2                1 rum         1.5 oz            1.5 
##  6 Fort L… Cocktail …      2                2 sweet verm… .5 oz             0.5 
##  7 Fort L… Cocktail …      2                3 orange jui… .25 oz            0.25
##  8 Fort L… Cocktail …      2                4 lime juice  .25 oz            0.25
##  9 Apple … Cordials …      3                1 apple schn… 3 oz              3   
## 10 Apple … Cordials …      3                2 cinnamon s… 1 oz              1   
## # … with 3,633 more rows

Let’s have another look at the most frequent ingredients:

cocktails_parsed %>%
  count(ingredient, sort = TRUE)
## # A tibble: 405 x 2
##    ingredient       n
##    <chr>        <int>
##  1 lemon juice    311
##  2 rum            184
##  3 lime juice     180
##  4 gin            176
##  5 bitters        166
##  6 whiskey        138
##  7 orange juice   124
##  8 simple syrup   115
##  9 vodka          114
## 10 dry vermouth   107
## # … with 395 more rows

We now have 405 of them, which is still too many. Let’s filter for ingredients that appear at least 15 times:

cocktails_parsed_filtered <- cocktails_parsed %>%
  add_count(ingredient) %>% # get a count of ingredients so we can filter out rare ones
  filter(n >= 15) %>%  # ingredient must appear at least 15 times
  select(-n) %>%
  distinct(row_id, ingredient, .keep_all = TRUE) %>%    # to stop repetitions e.g. if we had lemon slice and lemon juice in a recipe then we will end up with 2 'lemon's as per our mutate() calls earlier
  na.omit()

num_ingreds <- cocktails_parsed_filtered %>%
  distinct(ingredient) %>%
  summarise(n = n()) %>%
  as.integer()

num_ingreds
## [1] 39

So we now have 39 different ingredients in our dataset. Time for some exploratory analysis.

Exploratory Data Analysis

Let’s take another look at the most common ingredients, this time as a graph:

cocktails_parsed_filtered %>%
        count(ingredient, sort = TRUE) %>%
        head(20) %>%
        mutate(ingredient = fct_reorder(ingredient, n)) %>%
        ggplot(aes(n, ingredient)) +
        geom_col() +
        labs(title = "Top 20 most common ingredients in cocktail recipes",
             x = "Count") +
        theme(axis.text = element_text(size = 13),
              axis.title = element_text(size = 15))

This looks much better. Unsurprisingly lemon juice is at the top, while the most common alcoholic ingredient is gin closely followed by rum. Bitters, whiskey and vodka are also very popular, which won’t come as any surprise if you go to bars quite frequently like I do.

Let’s see how these numbers are broken down by another of the variables, category of cocktail, shown in colour in the next plot. I’ve changed the x-axis to “Percentage of all recipes” so we get a better idea of how common they really are:

n_recipes <- n_distinct(cocktails_parsed_filtered$name)  # get the total no.recipes

cocktails_parsed_filtered %>%
        mutate(category = fct_lump_n(category, 8),  # Get the 6 most frequent categories and lump the rest into 'Other'
               ingredient = fct_lump_n(ingredient, 20)) %>%
        filter(ingredient != "Other") %>%
        count(category, ingredient, sort = TRUE) %>%  # D. Robinson included this line a few lines too early, e.g. vodka was erroneously being included in 'Other'
        mutate(ingredient = fct_reorder(ingredient, n, sum)) %>%
        ggplot(aes(n/n_recipes, ingredient, fill = category)) +   # turn x-axis into %
        geom_col() +
        scale_x_continuous(labels = percent_format()) +
        scale_fill_brewer(palette = "Set1") + # change default colour scheme
        labs(title = "Most common ingredients broken down by drink category",
             x = "% of all recipes",
             y = "Ingredient",
             fill = "Category") +
        theme(axis.text = element_text(size = 13),
              axis.title = element_text(size = 15))

I think this break-down by category isn’t particularly reliable - for example, most drinks containing gin aren’t in the ‘Gin’ category. The ‘Cocktail Classics’ category seems to include plenty of drinks which could be placed in other categories. Perhaps a better way to organise this data would have been to allow multiple categories per drink?…We’ll revisit this later.

Co-occurrence of ingredients

Let’s look at another question now: Which ingredients tend to appear together in the same recipe?

First we calculate correlation between ingredients:

library(widyr)
library(tidytext)

ingredient_pairs <- cocktails_parsed_filtered %>%
        add_count(ingredient) %>%   # like group_by then count
        filter(n >= 5) %>% # filter only for ingredients that appear in min. 5 recipes
        pairwise_cor(ingredient, name, sort = TRUE)

ingredient_pairs
## # A tibble: 1,482 x 3
##    item1          item2          correlation
##    <chr>          <chr>                <dbl>
##  1 whole egg      powdered sugar       0.361
##  2 powdered sugar whole egg            0.361
##  3 port           powdered sugar       0.278
##  4 powdered sugar port                 0.278
##  5 dry vermouth   gin                  0.276
##  6 gin            dry vermouth         0.276
##  7 egg yolk       powdered sugar       0.251
##  8 powdered sugar egg yolk             0.251
##  9 lime juice     rum                  0.251
## 10 rum            lime juice           0.251
## # … with 1,472 more rows

Now let’s see which ingredients are most closely correlated with some of the most popular ones:

ingredient_pairs %>%
        filter(item1 %in% c("lemon juice", "gin", "rum", "whiskey", "vodka")) %>%
        group_by(item1) %>%
        top_n(8, correlation) %>%
        ungroup() %>%   # wasn't included in D. Robinson's code, but is needed
        mutate(item2 = reorder_within(item2, by = correlation, within = item1)) %>%  # need the 'within' argument for faceting
        ggplot(aes(correlation, item2)) +
        geom_col() +
        facet_wrap(~ item1, scales = "free_y") +
        scale_y_reordered() +
        labs(title = "Which ingredients are most correlated with selected ones?",
             y = "Most correlated ingredients") +
        theme(text = element_text(size= 12))

These plots make sense: gin is often found with vermouth, which is how you make martinis and negronis; lemon juice is often found with sugar/syrup to balance out the acidity; while rum and lime juice is a classic combination as are vodka/cranberry juice and whiskey/bitters.

Incidentally I didn’t realise that gin is often combined with apricot-flavoured brandy but further investigation reveals that there are 18 such drinks!:

cocktails_parsed_filtered %>%
  group_by(name) %>%  
  filter(all(c("gin", "apricot flavored brandy") %in% ingredient)) %>%
  arrange(name) %>%
  distinct(name)
## # A tibble: 18 x 1
## # Groups:   name [18]
##    name                    
##    <chr>                   
##  1 Apricot Anise Collins   
##  2 Apricot Anisette Collins
##  3 Apricot Cocktail        
##  4 Bermuda Rose            
##  5 Fairy Belle Cocktail    
##  6 Favorite Cocktail       
##  7 Favourite Cocktail      
##  8 Frankenjack Cocktail    
##  9 Golden Dawn             
## 10 K. G. B. Cocktail       
## 11 K.G.B. Cocktail         
## 12 Resolute Cocktail       
## 13 Webster Cocktail        
## 14 Wembley Cocktail        
## 15 Wembly Cocktail         
## 16 Western Rose            
## 17 What The Hell           
## 18 Why Not?

Network analysis

R has a couple of nice packages for displaying network graphs, ggraph and igraph. Let’s use them to plot the ‘connectedness’ of our data. Although not formally an unsupervised learning technique this should give us an idea as to what kind of clusters exist in the data: ingredients which are often found in the same drink will be more closely connected.

library(ggraph)
library(igraph)
library(ggrepel)

ingredients_summarized <- cocktails_parsed_filtered %>%         
        group_by(name) %>%
        mutate(percentile = row_number() / n()) %>%  # Tells us relative position of ingredient in drink - accounts for fact that some recipes have many ingredients and some don't
        ungroup() %>%
        group_by(ingredient) %>%
        summarize(n = n(),
                  avg_position = mean(percentile), # get average position in drink of ingredient
                  avg_serving = mean(measure_number, na.rm = TRUE)) %>%  # get average serving measure of ingredient
        arrange(desc(n))

# Get top 70 correlated ingredients
top_cors <- ingredient_pairs %>% head(70)  

# Filter ingredients_summarised for only highly correlated ingredients
ingredient_info <- ingredients_summarized %>%
        filter(ingredient %in% top_cors$item1) %>%
        distinct()

# Generate network graph

set.seed(3)

top_cors %>%
        graph_from_data_frame(vertices = ingredient_info) %>%
        ggraph(layout = "fr") +
        geom_edge_link() +
        geom_node_text(aes(label = name), repel = TRUE) +
        geom_node_point(aes(size = 1.1* n)) +        
        geom_node_point(aes(size = n, color = avg_position)) +
        scale_color_gradient2(low = "blue", high = "red", midpoint = .63,
                              labels = scales::percent_format()) +
        labs(size = "# of recipes",
             color = "Avg position in drink",
             title = "Network of cocktail ingredients",
             subtitle = "Connected ingredients tend to appear in the same drink. A bluer data point indicates \nthat it tends to appear earlier in recipes.")

This is a nice plot! As before, we see popular combinations of ingredients (e.g. whiskey/bitters, gin/vermouth, lemon juice/sugar) but now we can see the full picture. Unsurprisingly the spirits tend to be quite far apart (except for gin and brandy) and surrounded by flavouring agents such as juices and soft liqueurs. The bluer colour of the spirits show that they tend to be earlier in the drink recipes.

If in doubt as to what kind of drink to make next time you’re hosting a party this might be a good starting point!

Unsupservised Learning

Lastly, we can try some more formal unsupervised learning techniques to investigate our data: Principal Component Analysis (PCA) and t-Distributed Stochastic Neighbour Embedding (t-SNE).

Principal component analysis

PCA is a classic unsupervised learning technique. I won’t explain the details of it here, but for anyone unfamilar with it we can summarise the purpose of PCA as follows:

“When faced with a large set of correlated variables, PCA allows us to summarize this set with a smaller number of representative variables that collectively explain most of the variability in the original set…PCA also serves as a tool for data visualization.”
— An Introduction to Statistical Learning, by James et al

Essentially, PCA involves performing a linear mapping of the data to a lower-dimensional space such that it helps us understand where most of the variance came from in the original data. To illustrate this, see the following plot which shows the ingredients with the highest values (both positive and negative) in the first two principal components:

## Adapted from Julia Silge's code

# Pivot the data to get the ingredients as columns i.e. get a highly-dimensional dataset
cocktails_df <- cocktails_parsed_filtered %>%
  mutate(category = fct_lump_n(category, 7)) %>%
  select(-ingredient_number, -row_id, -measure) %>%
  pivot_wider(names_from = ingredient, values_from = measure_number, values_fill = 0) %>%
  janitor::clean_names() %>%
  arrange(name) %>%
  na.omit()


# Use new tidymodels package to run PCA
library(tidymodels)

pca_rec <- recipe(~., data = cocktails_df) %>%
  update_role(name, category, new_role = "id") %>%
  step_normalize(all_predictors()) %>%   # centers and scales all predictors
  step_pca(all_predictors())

pca_prep <- prep(pca_rec)

tidied_pca <- tidy(pca_prep, 2)

# Visualise the principal components
library(tidytext)

tidied_pca %>%
  filter(component %in% paste0("PC", 1:2)) %>%
  group_by(component) %>%
  top_n(8, abs(value)) %>%
  ungroup() %>%
  mutate(terms = reorder_within(terms, by = abs(value), within = component)) %>%  # from tidytext package - reorders column before plotting with faceting, such that the values are ordered within each facet...needs scale_x_reordered or scale_y_reordered to be used later
  ggplot(aes(value, terms, fill = value > 0)) +
  geom_col() +
  facet_wrap(~component, scales = "free_y") +
  scale_y_reordered() +  # see earlier comment
  labs(
    x = "Absolute value of contribution",
    y = NULL, fill = "Positive?"
  ) +
  guides(fill=FALSE) +
  theme(axis.text = element_text(size = 13),
        axis.title = element_text(size = 14))

We can infer from this that PC1 mainly tells us about the contrast between gin/sugar/egg drinks one one side versus the rum/lime/syrup drinks on the other. PC2 is mostly about the gin/vermouth drinks (i.e. martinis) versus the sweet/eggy/lemony drinks. This tallies with our network analysis from earlier, where you can make out these little clusters in the graph if you look closely:

t-SNE

Finally we come to t-SNE, a more modern technique for data dimensionality reduction and visualisation. Again I won’t go into the details here, but the key thing to know is that unlike PCA, which focusses on representing variance in the original data, t-SNE tries to find a two-dimensional represenation of the data that preserves the distances between points as best as possible. That is, instances which are close together in the original (high dimensional) feature space are kept close together, while instances which are far apart in the original feature space are kept far apart. We can use it to get an idea of what clusters exist in the original feature space.

Here we will apply t-SNE to a question that came up earlier - how much of an overlap is there between the Classic Cocktail category and the other ones? Below is the output of the t-SNE algorithm. Note that the x- and y-dimensions don’t have any intuitive meaning here, we are just interested in how close the data are to each other and whether we can discern any clusters.

library(Rtsne)

set.seed(42)

tsne_features <- cocktails_df %>% select(-c(1,2)) # Remove non-numeric features from the dataset, i.e. only data about the ingredient measures

tsne_obj <- Rtsne(tsne_features, perplexity = 20, check_duplicates = FALSE) # Run t-SNE

tt <- as_tibble(tsne_obj$Y)

ggplot(data = tt, mapping = aes(x = V1, y = V2)) +
        geom_point() +
      labs(title = "t-SNE plot of cocktail data. Each dot is a drink.",
           x = "Dimension 1",
           y = "Dimension 2")  

It looks like we do have some clusters here - do they correspond to the various categories of cocktail? In the next plot I have coloured each data point according to its category:

tt$name <- cocktails_df$name

ggplot(data = tt, mapping = aes(x = V1, y = V2, colour = cocktails_df$category, text = tt$name)) +
        geom_point() +
      scale_color_brewer(palette = "Set1") +
      labs(title = "t-SNE plot of cocktail data, overlaid by colour to show drink category",
          colour = "Category",
           x = "Dimension 1",
           y = "Dimension 2")  

This is another interesting plot. We see fairly distinct clusters of drinks for the whiskey, vodka, brandy, gin and rum-based cocktails, with a mixture in the middle around (0,0). Meanwhile the Cocktail Classics coloured in blue overlap with pretty much all the other categories, although not so much with the tequila-based drinks clustered in orange and the vodka-based drinks clustered in yellow. It turns out that only 7 of the Cocktail Classics contain tequila and only 12 contain vodka so clearly these are are considered less ‘classic’ ingredients for cocktails.

Conclusions

  • Lemon juice is by far the most common ingredient in cocktails.
  • Gin, rum and whiskey are the most common spirits.
  • Cocktails are all about tried-and-tested combinations of ingredients: gin with vermouth; citrus with egg and sugar; rum with lime; whiskey with bitters…
  • Out of the spirits, tequila and vodka are considered the least “classic” as ingredients.
Avatar
Data Scientist