Preface

This case study is the capstone project for the Google Data Analytics Professional Certificate, which consists of 8 courses. Participants can learn key analytical skills (data cleaning, analysis, & visualization) and tools (spreadsheets, SQL, R programming, Tableau), and gain an immersive understanding of the practices and processes of a data analyst.

Table of Contents

  1. Introduction
    1.1 Company background
    1.2 Business task
    1.3 Key stakeholders
  2. Data Preparation
    2.1 Data Source
    2.2 Data limitations
  3. Data processing
    3.1 Files upload
    3.2 Data transformation using SQL
  4. Data analysis and visualization
    4.1 User engagement
    4.2 Calories
    4.3 Heart rate
    4.4 Time dimension
    4.5 Sleeping time
  5. Key findings
  6. Conclusion and recommendation


1. Introduction

1.1 Company background

Bellabeat, founded by Urška Sršen and Sando Mur, is a high-tech company that manufactures health-focused smart products for women. Collecting data on activity, sleep, stress, and reproductive health has allowed the company to empower women with knowledge about their own health and habits.

Bellabeat is a successful small company, and have the potential to become a larger player in the global smart device market. Urška Sršen, cofounder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help inform the company marketing strategy and unlock new growth opportunities.

1.2. Business task

The main objective of the case study is to identify any possible trends in the fitness data from non-Bellabeat smart devices in order to gain insight into consumer behaviour.

Recommendations will be provided for Bellabeat’s marketing strategy on how the insights could be applied to their products and hence help reveal opportunities for the company’s growth.

1.3 Key stakeholders

  • Urška Sršen: Bellabeat’s cofounder and Chief Creative Officer

  • Sando Mur: Mathematician, Bellabeat’s cofounder and key member of the Bellabeat executive team

  • Bellabeat marketing analytics team: A team of data analysts responsible for collecting, analyzing, and reporting data that helps guide Bellabeat’s marketing strategy.

2. Data preparation

2.1 Data Source

  • The FitBit Fitness Tracker Data(CC0: Public Domain, dataset made available through Mobius) on Kaggle is used.

  • The dataset is under CC0 licence, which means that the person who associated a work with this deed has dedicated the work to the public domain by waiving all of his or her rights to the work worldwide under copyright law, including all related and neighbouring rights, to the extent by law. The work can be copied, modified, distributed and performed, even for commercial purposes, all without asking permission.

  • The data originated from the Amazon Mechanical Turk who collected the data between 3rd December 2016 to 5th December 2016.

  • The dataset contains personal fitness trackers from thirty three fitbit users. These eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits.

  • The dataset consists of total 18 csv files, in which 15 are long format and 3 are wide format.

2.2 Data limitations

  • Lack of demographics: Bellebeat’s products are focusing on women health while data such as gender and age are missing in the dataset.

  • Limited time frame: Data of 31 days may not be enough for providing solid high-level recommendations. There could be seasonal variations which affect the users’ physical activity such as the exercise choices.

  • Small sample size: Data of 33 users may be insufficient for providing an objective big picture of the overall situation.

  • The data was collected in year 2016 which is not up to date and may not reflect the current situation of users.

3. Data processing

3.1 Files upload

  • By analyzing the dailyActivity_merged.csv using pivot table in Microsoft Excel, it is observed that the file contains the same data as the dailyCalories_merged.csv, dailyIntensities_merged.csv, dailySteps_merged.csv, thus the latter 3 files will not be used.

  • minuteCaloriesWide_merged.csv, minuteIntensitiesWide_merged.csv, minuteStepsWide_merged.csv also have the same data as that in the minuteCaloriesNarrow_merged.csv, minuteIntensitiesNarrow_merged.csv, minuteStepsNarrow_merged.csv. The former 3 files in wide format will not be used.

  • By analyzing the weightLogInfo_merged.csv using pivot table in Microsoft Excel, the result in the following table shows that it contains only 67 observations with 8 different users. The numbers of record by each user are also too few to have significant representation for analysis.

    User Id

    Count of Id

    1503960366

    2

    1927972279

    1

    2873212765

    2

    4319703577

    2

    4558609924

    5

    5577150313

    1

    6962181067

    30

    8877689391

    24

    Grand Total

    67

    Out of the 67 records, 41 of them (equal to 61.2%) were input manually and it may be subject to errors. For all these reasons this file will not be used.

  • As the files contain more than 1 million rows that they are too large to be handled by Microsoft Excel, thus SQL queries on the data warehouse platform Google BigQuery is used.

  • When uploading to Google BigQuery platform, the following error message appeared for some files: Failed to create table: Error while reading data, error message: Could not parse ‘4/12/2016 12:00:00 AM’ as TIMESTAMP for field SleepDay (position 1) starting at location 65 with message ‘Invalid time zone: AM’

  • In order to upload successfully, the Schema was manually input with the data type of the error column using STRING instead of TIMESTAMP.

3.2 Data transformation using SQL

  • The tables related to hourly activity are merged and the result is saved into a new table hourlyActivity_merged:
CREATE TABLE
  `capstone-project-22.Bellabeat_data.hourlyActivity_merged` AS
SELECT
  h1.Id,
  CAST(FORMAT_TIMESTAMP( '%Y-%m-%d %H:%M:%S',
    PARSE_TIMESTAMP('%m/%d/%Y %I:%M:%S %p', h1.ActivityHour)) AS TIMESTAMP)
    AS ActivityHour,
  h1.StepTotal,
  h2.Calories,
  h3.AverageIntensity,
  h3.TotalIntensity
FROM
  `capstone-project-22.Bellabeat_data.hourlySteps_merged` AS h1
FULL JOIN
  `capstone-project-22.Bellabeat_data.hourlyCalories_merged` AS h2
ON
  h1.Id = h2.Id
  AND h1.ActivityHour = h2.ActivityHour
FULL JOIN
  `capstone-project-22.Bellabeat_data.hourlyIntensities_merged` AS h3
ON
  h1.Id = h3.Id
  AND h1.ActivityHour = h3.ActivityHour
ORDER BY
  Id,
  ActivityHour
  • The tables related to minute activity are merged and the result is saved into a new table minuteActivity_merged:
CREATE TABLE
  `capstone-project-22.Bellabeat_data.minuteActivity_merged` AS
SELECT
  COALESCE(t1.Id,t2.Id,t3.Id,t4.Id,t5.Id) AS Id,
  CAST(FORMAT_TIMESTAMP('%Y-%m-%d %H:%M:%S',
    PARSE_TIMESTAMP('%m/%d/%Y %I:%M:%S %p',
      COALESCE(t1.ActivityMinute, t2.ActivityMinute, t3.ActivityMinute, t4.date, t5.ActivityMinute))) AS TIMESTAMP)
    AS ActivityMinute,
  t1.Calories,
  t2.Intensity,
  t3.METs,
  t4.logId AS sleep_Id,
  t4.value AS sleep_value,
  t5.Steps
FROM
  `capstone-project-22.Bellabeat_data.minuteCaloriesNarrow_merged` AS t1
FULL JOIN
  `capstone-project-22.Bellabeat_data.minuteIntensitiesNarrow_merged` AS t2
ON
  t1.Id = t2.Id
  AND t1.ActivityMinute = t2.ActivityMinute
FULL JOIN
  `capstone-project-22.Bellabeat_data.minuteMETsNarrow_merged` AS t3
ON
  t1.Id = t3.Id
  AND t1.ActivityMinute = t3.ActivityMinute
FULL JOIN
  `capstone-project-22.Bellabeat_data.minuteSleep_merged` AS t4
ON
  t1.Id = t4.Id
  AND t1.ActivityMinute = t4.date
FULL JOIN
  `capstone-project-22.Bellabeat_data.minuteStepsNarrow_merged` AS t5
ON
  t1.Id = t5.Id
  AND t1.ActivityMinute = t5.ActivityMinute
ORDER BY
  Id,
  ActivityMinute
  • The data type of the columns in the following tables (which were saved as STRING instead of TIMESTAMP in the previous step) is corrected and saved into new tables with revised_ added into the beginning of the table names.

    • heartrate_seconds_merged:
    CREATE TABLE
      `capstone-project-22.Bellabeat_data.revised_heartrate_seconds_merged` AS
    SELECT
      Id,
      CAST(FORMAT_TIMESTAMP('%Y-%m-%d %H:%M:%S',
        PARSE_TIMESTAMP('%m/%d/%Y %I:%M:%S %p', Time)) AS TIMESTAMP)
      AS time,
      Value
    FROM
      `capstone-project-22.Bellabeat_data.heartrate_seconds_merged`
    ORDER BY
      Id,
      time
    • sleepDay_merged:
    CREATE TABLE
      `capstone-project-22.Bellabeat_data.revised_sleepDay_merged` AS
    SELECT
      Id,
      CAST(FORMAT_TIMESTAMP('%Y-%m-%d %H:%M:%S',
        PARSE_TIMESTAMP('%m/%d/%Y %I:%M:%S %p', SleepDay)) AS TIMESTAMP)
      AS SleepDay,
      TotalSleepRecords,
      TotalMinutesAsleep,
      TotalTimeInBed
    FROM
      `capstone-project-22.Bellabeat_data.sleepDay_merged`
    ORDER BY
      Id,
      SleepDay
  • As a result, the following 5 tables will be used thereafter:

    • dailyActivity_merged
    • hourlyActivity_merged
    • minuteActivity_merged
    • revised_heartrate_seconds_merged
    • revised_sleepDay_merged

4. Data analysis and visualization

The following section will examine data based on different metrics, which are related to calories burned, heart rate, time dimension, and sleeping time.
SQL using Google BigQuery for queries and Tableau for visualization are used as the analytics platforms.

4.1 User engagement

  • Before inspecting different metrics, let’s take a look into the number of users engaged during the period of data collection.
  • It is observed that the number of user dropped from initially 33 to 21 on the last day.

4.2 Calories

4.2.1 Calories by daily total distance

  • The calories burned has a medium correlation with the distance travelled.

4.2.2 Calories by daily total step

  • Similar to the distance travelled, the calories burned also don’t have a strong correlation with the number of steps.

4.2.3 Calories by hourly intensity

  • While for the intensity, it has a relatively strong correlation with the calories burned, with the R-squared value 77.5%.

4.2.4 Calories burned per minute by different Metabolic equivalent of task (MET) values

  • MET stands for Metabolic Equivalent of Task. According to UK Chief Medical Officers’ Physical Activity Guidelines, it is the objective measure of the ratio of the rate at which a person expends energy, relative to the mass of that person, while performing some specific physical activity compared to the energy expended whilst sedentary. In short, it is a unit to measure the energy consumed of an activity when compared to being at rest.

  • As suggested by Harvard T.H. Chan School of Public Health, physical activities could be rated using METs to indicate their intensity:

    • Sedentary: METs <= 1.5. Examples are sitting, reclining, or lying down.
    • Light intensity: 1.5 < METs < 3. Examples are walking at a leisurely pace or standing in line at the store.
    • Moderate intensity: 3 <= METs < 6. Examples are walking briskly, vacuuming, or raking leaves.
    • Vigorous intensity: METs >= 6. Examples are walking very quickly, running, taking an aerobics class, or shovelling snow.
  • There is an even stronger correlation of calories burned with the METs of R-squared value 85.6%.

4.3 Heart rate

4.3.1 Heart rate of different users

  • Heart rate of different users
SELECT
  AVG(max) AS avg_max,
  AVG(min) AS avg_min
FROM (
  SELECT
    MAX(Value) AS max,
    MIN(Value) AS min
  FROM
    `capstone-project-22.Bellabeat_data.revised_heartrate_seconds_merged`
  GROUP BY
    Id)
  • According to American Heart Association, for a normal person the resting heart rate is between 60 bpm and 100 bpm, but it can be affected by factors such as stress, anxiety, hormones, medication, and the activeness of a person. For example, an athlete can have a 40 bpm resting heart rate.
  • And normally the maximum heart rate is about 220 minus age. The target heart rate zones of different age groups are shown in the following table:
Age Target Heart Rate Zone 50-85% Average Maximum Heart Rate, 100%
20 years 100-170 beats per minute (bpm) 200 bpm
30 years 95-162 bpm 190 bpm
35 years 93-157 bpm 185 bpm
40 years 90-153 bpm 180 bpm
45 years 88-149 bpm 175 bpm
50 years 85-145 bpm 170 bpm
55 years 83-140 bpm 165 bpm
60 years 80-136 bpm 160 bpm
65 years 78-132 bpm 155 bpm
70 years 75-128 bpm 150 bpm

Source: American Heart Association

  • As observed the heart rates of different users are ranging from 36 to 203 beats per minute (bpm).
  • Due to the lack of user background information such as gender, age, occupation e.t.c. it could not be determined whether their heart rates fall at normal or danger zone.

4.3.2 Average heart rates of steps per minute

SELECT
  t1.Id,
  t1.Steps AS step,
  AVG(t2.value) AS heartrate
FROM
  `capstone-project-22.Bellabeat_data.minuteActivity_merged` AS t1
LEFT JOIN
  `capstone-project-22.Bellabeat_data.revised_heartrate_seconds_merged` AS t2
ON
  t1.Id = t2.Id
  AND FORMAT_TIMESTAMP('%Y-%m-%d %H:%M', t1.ActivityMinute) = FORMAT_TIMESTAMP('%Y-%m-%d %H:%M', t2.time)
WHERE
  t1.Steps IS NOT NULL
GROUP BY
  1,
  2
HAVING
  AVG(t2.value) IS NOT NULL
ORDER BY
  2
  • As the number of steps per minute goes up, the heart rate rises as well.

4.3.3 Average heart rates of steps per minute by different users

  • The range of average heart rates for some users are relatively large, with the 5th user (counting from left) being the largest.
  • As mentioned, factors such as user lifestyle, occupation, have to be taken into account to see if the heart rates are within normal level or not.

4.3.4 Calories burned by average heart rate (per minute)

SELECT
  t1.Id,
  AVG(t1.value) AS heartrate,
  t2.Calories
FROM
  `capstone-project-22.Bellabeat_data.revised_heartrate_seconds_merged` AS t1
LEFT JOIN
  `capstone-project-22.Bellabeat_data.minuteActivity_merged` AS t2
ON
  t1.Id = t2.Id
  AND FORMAT_TIMESTAMP('%Y-%m-%d %H:%M', t1.Time) = FORMAT_TIMESTAMP('%Y-%m-%d %H:%M', t2.ActivityMinute)
WHERE
  t2.Calories IS NOT NULL
GROUP BY
  1,
  3
ORDER BY
  3
  • It is not surprising to see that if users want to burn more calories, they need to exercise to achieve a higher heart rate.
  • From the chart, generally if the users need to burn 10 calories in a minute, on average they need to perform exercises which can keep the heart rate at around 130 for 1 minute.

4.4 Time dimension

4.4.1 Average active distance by day of week

  • There is not much difference between the average active distances across the week, ranging from around 5km to 5.8km.
  • Saturday is the most active day, whereas Sunday is the day with the least activeness.
  • But it is noted that the longest very active distances happen on Tuesday and Wednesday.

4.4.2 Average active minutes by day of week

  • Much the same as the average active distance, Saturday has got the longest length of active minute and Sunday the shortest.
  • Unlike day of week of the longest very active distances, Monday and Tuesday got the highest very active minutes.
  • The average total of all the days is 227.6 mins, corresponding to 3.8 hours.

4.4.3 Average intensity by hour

SELECT
  FORMAT_TIMESTAMP('%I %p', ActivityHour) AS hour,
  AVG(TotalIntensity) as average_intensity
FROM
  `capstone-project-22.Bellabeat_data.hourlyActivity_merged`
GROUP BY
  hour
  • When looking into the intensity of the users by different time in a day, there is an increasing trend starting from 4am, and arriving at peak level at 6pm to 7pm.

4.5 Sleeping time

4.5.1 Average minute of asleep and time in bed by day

SELECT
  DATE(SleepDay) AS date,
  AVG(TotalMinutesAsleep) AS average_asleep_min,
  AVG(TotalTimeInBed) AS average_bed_min,
  AVG(TotalMinutesAsleep)/AVG(TotalTimeInBed)*100 AS asleep_bed_percentage
FROM
  `capstone-project-22.Bellabeat_data.revised_sleepDay_merged`
GROUP BY
  date
ORDER BY
  date
SELECT
  AVG(asleep_bed_percentage) as avg_asleep_bed_percentage
FROM (
  SELECT
    DATE(SleepDay) AS date,
    AVG(TotalMinutesAsleep) AS average_asleep_min,
    AVG(TotalTimeInBed) AS average_bed_min,
    AVG(TotalMinutesAsleep)/AVG(TotalTimeInBed)*100 AS asleep_bed_percentage
  FROM
    `capstone-project-22.Bellabeat_data.revised_sleepDay_merged`
  GROUP BY
    date
  ORDER BY
    date)
  • The overall asleep time vs time in bed has a ratio of 91.6% and it is considered as adequate for being a healthy sleeper which requires 90% or above. *While for the recommended hours of sleep per day which is 7 hours for adult, on average there are only 16 days, or 51.6% of all days, meet the suggested time.

4.5.2 Average minute of asleep and time in bed by day of week

SELECT
  FORMAT_DATE('%A', SleepDay) AS day_of_week,
  AVG(TotalMinutesAsleep) AS avg_asleep_min,
  AVG(TotalTimeInBed) AS avg_bed_min
FROM
  `capstone-project-22.Bellabeat_data.revised_sleepDay_merged`
GROUP BY
  1
ORDER BY
  (
    CASE
      WHEN day_of_week="Monday" THEN 1
      WHEN day_of_week="Tuesday" THEN 2
      WHEN day_of_week="Wednesday" THEN 3
      WHEN day_of_week="Thursday" THEN 4
      WHEN day_of_week="Friday" THEN 5
      WHEN day_of_week="Saturday" THEN 6
    ELSE
    7
  END
    )
  • The asleep time is generally above 90% of the time in bed on the corresponding day of week.
  • Users get the longest average time of sleep on Sundays.

4.5.3 Total minute of sleep by hour

SELECT
  FORMAT_TIMESTAMP('%I %p', ActivityMinute) AS hour,
  COUNT(sleep_Id) AS total_sleep_min,
FROM
  `capstone-project-22.Bellabeat_data.minuteActivity_merged`
GROUP BY
  hour
  • Users stay awake mainly between 8 am and 10 pm, and are mostly sleeping between 11pm to 7am.

5. Key findings

6. Conclusion and recommendation



Thank you very much for your time!