Recipe and Ratings
Introduction and Question Identification
This dataset includes recipes and ratings from the website food.com. The question that the analysis will be centered around is: What recipes tend to be lower in calories?This question is important because understanding what types of recipes tend to be lower in calories can help individuals identify healthier food choices.
The original raw datasets include one for recipes, RAW_interactions.csv, and one for ratings, RAW_ratings.csv. RAW_interactions.csv has 83782 rows of data, and RAW_interactions.csv has 731927 rows of data. The relevant columns for this analysis are: nutrition, n_steps, n_ingredients, rating, and review. nutrition is the nutrition information in the form [calories (#), total fat (PDV), sugar (PDV), sodium (PDV), protein (PDV), saturated fat (PDV), carbohydrates (PDV)]; PDV stands for “percentage of daily value”. n_steps is the number of steps in the recipe. n_ingredients is the number of ingredients in the recipe. rating is the rating given. review is the review text.
Cleaning and EDA
Data Cleaning
The first step taken was to left merge the two datasets together. This was done to explore potential relationships between the number of calories in a recipe and the ratings those recipes received. The next step was to fill all ratings of 0 with np.nan. This was done because zeroes in the dataset indicated that the values were missing, not that the user gave the recipe a rating of 0, since the lowest possible rating a user can give is 1. Thus, np.nan is a better representation of the value. The next step was to add the average rating per recipe back to RAW_recipes. This was to aggregate each pair of recipe and rating of a given recipe into one row, so that each recipe only had one row. The next step was splitting the nutrition column into
One row for every recipe, merged has one row for every pair of recipe and rating for a recipe. This was done so that analysis can be done on the relationships between the number of calories and individual nutrients. The average ratings were also rounded, so that the rating could be represented as a categorical variable for each recipe, just as it appeared on the original RAW_interactions.csv dataset.
Below is the first few rows the dataframe. Note that the tags, steps, description, ingredients, review columns have been omitted because they were not used in the analysis and contained text that made the table format difficult to read.
| name | id | minutes | contributor_id | submitted | n_steps | n_ingredients | rating | calories (#) | total fat (PDV) | sugar (PDV) | sodium (PDV) | protein (PDV) | saturated fat (PDV) | carbohydrates (PDV) | rating_rounded |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 brownies in the world best ever | 333281 | 40 | 985201 | 2008-10-27 | 10 | 9 | 4.0 | 138.4 | 10.0 | 50.0 | 3.0 | 3.0 | 19.0 | 6.0 | 4.0 |
| 1 in canada chocolate chip cookies | 453467 | 45 | 1848091 | 2011-04-11 | 12 | 11 | 5.0 | 595.1 | 46.0 | 211.0 | 22.0 | 13.0 | 51.0 | 26.0 | 5.0 |
| 412 broccoli casserole | 306168 | 40 | 50969 | 2008-05-30 | 6 | 9 | 5.0 | 194.8 | 20.0 | 6.0 | 32.0 | 22.0 | 36.0 | 3.0 | 5.0 |
| millionaire pound cake | 286009 | 120 | 461724 | 2008-02-12 | 7 | 7 | 5.0 | 878.3 | 63.0 | 326.0 | 13.0 | 20.0 | 123.0 | 39.0 | 5.0 |
| 2000 meatloaf | 475785 | 90 | 2202916 | 2012-03-06 | 17 | 13 | 5.0 | 267.0 | 30.0 | 12.0 | 12.0 | 29.0 | 48.0 | 2.0 | 5.0 |
Univariate Analysis
This is the distirbution of the 99th quantile of the calories (#), which shows that the distribution is heavily skewed right. Note that only data from the 99th quantile is shown to improve readability, and the red line indicates the mean of the distribution.
Bivariate Analysis
This is the a scatter plot of the number of calories versus the total fat, with only the 99th quantile of calories for readability. It shows a general linear positive trend between the two variables, indivating that higher calories are correlated with higher fat.
Interesting Aggregates
| rating_rounded | median calories |
|---|---|
| 1.0 | 275.9 |
| 2.0 | 302.4 |
| 3.0 | 307.7 |
| 4.0 | 310.0 |
| 5.0 | 302.7 |
The table above shows the median calorie count for every rating. One thing worth noting is that the median calorie count is much lower for recipes rated one star compared to other ratings. This finding is later used to set up a hypothesis test.
Assessment of Missingness
NMAR Analysis
When there are missing reviews in the reviews, it is likely because the user did not have any comments on the recipe. This indicates that review is NMAR, as the missingness of review depends on the value itself. review could not be MD, because there is not another column that we can use to recover information on whether or not the user had any additional comments. Additional data that would explain the missingness is a satisfied column, which has the values: extremely satisfied, neutral, and extremely unsatisfied. It would be MAR on this column, as reviews will be more likely to be missing within the neutral values of satisfied, since they would be less likely to have additional comments for a review.
Missingness Dependency
The column for which missingness was analyzed is the rating column.
The question asked was whether or not the missingness of rating is dependent on n_steps. Below, is a plot of the distribution of n_steps when rating is missing and when rating is not missing.
A K-S statistic was used to calculated a p-value, which was 0. At a significance level of 0.05, it is likely that the missingness of rating is dependent on n_steps, since the p-value is lower than the significance level.
Hypothesis Testing
The null hypothesis was: Rating and calories are not related. 1 star rated recipes have lower calories by chance. The alternative hypothesis was: Rating and calories are related. 1 star rated recipes are rated lower not by chance. The median of calories (#) was used as the test statistic, as the distribution of calories (#) is heavily skewed, so median would be a better measure of center than mean. A significance level of 0.05 was chosen. This hypothesis will be helpful in determining if certain ratings tend to be lower in calories, which directly addresses the overall question. Below is the empirical distribution of the median calories, along with the observed median calories, shown in red.
The resulting p-value was 0.006. At the signifance level of 0.05, we reject the null hypothesis. This implies that lower calorie recipes tend to have ratings of 1.