This case study is the capstone project for the Google Data Analytics
Professional Certificate, which consists of 8 courses. Participants can
learn key analytical skills (data cleaning, analysis, &
visualization) and tools (spreadsheets, SQL, R programming, Tableau),
and gain an immersive understanding of the practices and processes of a
data analyst.
Bellabeat, founded by Urška Sršen and Sando Mur, is a high-tech company that manufactures health-focused smart products for women. Collecting data on activity, sleep, stress, and reproductive health has allowed the company to empower women with knowledge about their own health and habits.
Bellabeat is a successful small company, and have the potential to
become a larger player in the global smart device market. Urška Sršen,
cofounder and Chief Creative Officer of Bellabeat, believes that
analyzing smart device fitness data could help inform the company
marketing strategy and unlock new growth opportunities.
The main objective of the case study is to identify any possible trends in the fitness data from non-Bellabeat smart devices in order to gain insight into consumer behaviour.
Recommendations will be provided for Bellabeat’s marketing strategy
on how the insights could be applied to their products and hence help
reveal opportunities for the company’s growth.
Urška Sršen: Bellabeat’s cofounder and Chief Creative Officer
Sando Mur: Mathematician, Bellabeat’s cofounder and key member of the Bellabeat executive team
Bellabeat marketing analytics team: A team of
data analysts responsible for collecting, analyzing, and reporting data
that helps guide Bellabeat’s marketing strategy.
The FitBit Fitness Tracker Data(CC0: Public Domain, dataset made available through Mobius) on Kaggle is used.
The dataset is under CC0 licence, which means that the person who associated a work with this deed has dedicated the work to the public domain by waiving all of his or her rights to the work worldwide under copyright law, including all related and neighbouring rights, to the extent by law. The work can be copied, modified, distributed and performed, even for commercial purposes, all without asking permission.
The data originated from the Amazon Mechanical Turk who collected the data between 3rd December 2016 to 5th December 2016.
The dataset contains personal fitness trackers from thirty three fitbit users. These eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits.
The dataset consists of total 18 csv files, in which 15 are long
format and 3 are wide format.
Lack of demographics: Bellebeat’s products are focusing on women health while data such as gender and age are missing in the dataset.
Limited time frame: Data of 31 days may not be enough for providing solid high-level recommendations. There could be seasonal variations which affect the users’ physical activity such as the exercise choices.
Small sample size: Data of 33 users may be insufficient for providing an objective big picture of the overall situation.
The data was collected in year 2016 which is not up to date and
may not reflect the current situation of users.
By analyzing the dailyActivity_merged.csv using pivot table in Microsoft Excel, it is observed that the file contains the same data as the dailyCalories_merged.csv, dailyIntensities_merged.csv, dailySteps_merged.csv, thus the latter 3 files will not be used.
minuteCaloriesWide_merged.csv, minuteIntensitiesWide_merged.csv, minuteStepsWide_merged.csv also have the same data as that in the minuteCaloriesNarrow_merged.csv, minuteIntensitiesNarrow_merged.csv, minuteStepsNarrow_merged.csv. The former 3 files in wide format will not be used.
By analyzing the weightLogInfo_merged.csv using
pivot table in Microsoft Excel, the result in the following table shows
that it contains only 67 observations with 8 different users. The
numbers of record by each user are also too few to have significant
representation for analysis.
User Id |
Count of Id |
---|---|
1503960366 |
2 |
1927972279 |
1 |
2873212765 |
2 |
4319703577 |
2 |
4558609924 |
5 |
5577150313 |
1 |
6962181067 |
30 |
8877689391 |
24 |
Grand Total |
67 |
Out of the 67 records, 41 of them (equal to 61.2%) were input manually and it may be subject to errors. For all these reasons this file will not be used.
As the files contain more than 1 million rows that they are too large to be handled by Microsoft Excel, thus SQL queries on the data warehouse platform Google BigQuery is used.
When uploading to Google BigQuery platform, the following error message appeared for some files: Failed to create table: Error while reading data, error message: Could not parse ‘4/12/2016 12:00:00 AM’ as TIMESTAMP for field SleepDay (position 1) starting at location 65 with message ‘Invalid time zone: AM’
In order to upload successfully, the Schema was manually input
with the data type of the error column using STRING instead of
TIMESTAMP.
CREATE TABLE
`capstone-project-22.Bellabeat_data.hourlyActivity_merged` AS
SELECT
h1.Id,
CAST(FORMAT_TIMESTAMP( '%Y-%m-%d %H:%M:%S',
PARSE_TIMESTAMP('%m/%d/%Y %I:%M:%S %p', h1.ActivityHour)) AS TIMESTAMP)
AS ActivityHour,
h1.StepTotal,
h2.Calories,
h3.AverageIntensity,
h3.TotalIntensity
FROM
`capstone-project-22.Bellabeat_data.hourlySteps_merged` AS h1
FULL JOIN
`capstone-project-22.Bellabeat_data.hourlyCalories_merged` AS h2
ON
h1.Id = h2.Id
AND h1.ActivityHour = h2.ActivityHour
FULL JOIN
`capstone-project-22.Bellabeat_data.hourlyIntensities_merged` AS h3
ON
h1.Id = h3.Id
AND h1.ActivityHour = h3.ActivityHour
ORDER BY
Id,
ActivityHour
CREATE TABLE
`capstone-project-22.Bellabeat_data.minuteActivity_merged` AS
SELECT
COALESCE(t1.Id,t2.Id,t3.Id,t4.Id,t5.Id) AS Id,
CAST(FORMAT_TIMESTAMP('%Y-%m-%d %H:%M:%S',
PARSE_TIMESTAMP('%m/%d/%Y %I:%M:%S %p',
COALESCE(t1.ActivityMinute, t2.ActivityMinute, t3.ActivityMinute, t4.date, t5.ActivityMinute))) AS TIMESTAMP)
AS ActivityMinute,
t1.Calories,
t2.Intensity,
t3.METs,
t4.logId AS sleep_Id,
t4.value AS sleep_value,
t5.Steps
FROM
`capstone-project-22.Bellabeat_data.minuteCaloriesNarrow_merged` AS t1
FULL JOIN
`capstone-project-22.Bellabeat_data.minuteIntensitiesNarrow_merged` AS t2
ON
t1.Id = t2.Id
AND t1.ActivityMinute = t2.ActivityMinute
FULL JOIN
`capstone-project-22.Bellabeat_data.minuteMETsNarrow_merged` AS t3
ON
t1.Id = t3.Id
AND t1.ActivityMinute = t3.ActivityMinute
FULL JOIN
`capstone-project-22.Bellabeat_data.minuteSleep_merged` AS t4
ON
t1.Id = t4.Id
AND t1.ActivityMinute = t4.date
FULL JOIN
`capstone-project-22.Bellabeat_data.minuteStepsNarrow_merged` AS t5
ON
t1.Id = t5.Id
AND t1.ActivityMinute = t5.ActivityMinute
ORDER BY
Id,
ActivityMinute
The data type of the columns in the following tables (which were saved as STRING instead of TIMESTAMP in the previous step) is corrected and saved into new tables with revised_ added into the beginning of the table names.
CREATE TABLE
`capstone-project-22.Bellabeat_data.revised_heartrate_seconds_merged` AS
SELECT
Id,
CAST(FORMAT_TIMESTAMP('%Y-%m-%d %H:%M:%S',
PARSE_TIMESTAMP('%m/%d/%Y %I:%M:%S %p', Time)) AS TIMESTAMP)
AS time,
Value
FROM
`capstone-project-22.Bellabeat_data.heartrate_seconds_merged`
ORDER BY
Id,
time
CREATE TABLE
`capstone-project-22.Bellabeat_data.revised_sleepDay_merged` AS
SELECT
Id,
CAST(FORMAT_TIMESTAMP('%Y-%m-%d %H:%M:%S',
PARSE_TIMESTAMP('%m/%d/%Y %I:%M:%S %p', SleepDay)) AS TIMESTAMP)
AS SleepDay,
TotalSleepRecords,
TotalMinutesAsleep,
TotalTimeInBed
FROM
`capstone-project-22.Bellabeat_data.sleepDay_merged`
ORDER BY
Id,
SleepDay
As a result, the following 5 tables will be used thereafter:
The following section will examine data based on different metrics,
which are related to calories burned, heart rate, time dimension, and
sleeping time.
SQL using Google BigQuery for queries and Tableau for visualization are
used as the analytics platforms.
MET stands for Metabolic Equivalent of Task. According to UK Chief Medical Officers’ Physical Activity Guidelines, it is the objective measure of the ratio of the rate at which a person expends energy, relative to the mass of that person, while performing some specific physical activity compared to the energy expended whilst sedentary. In short, it is a unit to measure the energy consumed of an activity when compared to being at rest.
As suggested by Harvard T.H. Chan School of Public Health, physical activities could be rated using METs to indicate their intensity:
There is an even stronger correlation of calories burned with the
METs of R-squared value 85.6%.
SELECT
AVG(max) AS avg_max,
AVG(min) AS avg_min
FROM (
SELECT
MAX(Value) AS max,
MIN(Value) AS min
FROM
`capstone-project-22.Bellabeat_data.revised_heartrate_seconds_merged`
GROUP BY
Id)
Age | Target Heart Rate Zone 50-85% | Average Maximum Heart Rate, 100% |
---|---|---|
20 years | 100-170 beats per minute (bpm) | 200 bpm |
30 years | 95-162 bpm | 190 bpm |
35 years | 93-157 bpm | 185 bpm |
40 years | 90-153 bpm | 180 bpm |
45 years | 88-149 bpm | 175 bpm |
50 years | 85-145 bpm | 170 bpm |
55 years | 83-140 bpm | 165 bpm |
60 years | 80-136 bpm | 160 bpm |
65 years | 78-132 bpm | 155 bpm |
70 years | 75-128 bpm | 150 bpm |
Source: American Heart Association
SELECT
t1.Id,
t1.Steps AS step,
AVG(t2.value) AS heartrate
FROM
`capstone-project-22.Bellabeat_data.minuteActivity_merged` AS t1
LEFT JOIN
`capstone-project-22.Bellabeat_data.revised_heartrate_seconds_merged` AS t2
ON
t1.Id = t2.Id
AND FORMAT_TIMESTAMP('%Y-%m-%d %H:%M', t1.ActivityMinute) = FORMAT_TIMESTAMP('%Y-%m-%d %H:%M', t2.time)
WHERE
t1.Steps IS NOT NULL
GROUP BY
1,
2
HAVING
AVG(t2.value) IS NOT NULL
ORDER BY
2
SELECT
t1.Id,
AVG(t1.value) AS heartrate,
t2.Calories
FROM
`capstone-project-22.Bellabeat_data.revised_heartrate_seconds_merged` AS t1
LEFT JOIN
`capstone-project-22.Bellabeat_data.minuteActivity_merged` AS t2
ON
t1.Id = t2.Id
AND FORMAT_TIMESTAMP('%Y-%m-%d %H:%M', t1.Time) = FORMAT_TIMESTAMP('%Y-%m-%d %H:%M', t2.ActivityMinute)
WHERE
t2.Calories IS NOT NULL
GROUP BY
1,
3
ORDER BY
3
SELECT
FORMAT_TIMESTAMP('%I %p', ActivityHour) AS hour,
AVG(TotalIntensity) as average_intensity
FROM
`capstone-project-22.Bellabeat_data.hourlyActivity_merged`
GROUP BY
hour
SELECT
DATE(SleepDay) AS date,
AVG(TotalMinutesAsleep) AS average_asleep_min,
AVG(TotalTimeInBed) AS average_bed_min,
AVG(TotalMinutesAsleep)/AVG(TotalTimeInBed)*100 AS asleep_bed_percentage
FROM
`capstone-project-22.Bellabeat_data.revised_sleepDay_merged`
GROUP BY
date
ORDER BY
date
SELECT
AVG(asleep_bed_percentage) as avg_asleep_bed_percentage
FROM (
SELECT
DATE(SleepDay) AS date,
AVG(TotalMinutesAsleep) AS average_asleep_min,
AVG(TotalTimeInBed) AS average_bed_min,
AVG(TotalMinutesAsleep)/AVG(TotalTimeInBed)*100 AS asleep_bed_percentage
FROM
`capstone-project-22.Bellabeat_data.revised_sleepDay_merged`
GROUP BY
date
ORDER BY
date)
SELECT
FORMAT_DATE('%A', SleepDay) AS day_of_week,
AVG(TotalMinutesAsleep) AS avg_asleep_min,
AVG(TotalTimeInBed) AS avg_bed_min
FROM
`capstone-project-22.Bellabeat_data.revised_sleepDay_merged`
GROUP BY
1
ORDER BY
(
CASE
WHEN day_of_week="Monday" THEN 1
WHEN day_of_week="Tuesday" THEN 2
WHEN day_of_week="Wednesday" THEN 3
WHEN day_of_week="Thursday" THEN 4
WHEN day_of_week="Friday" THEN 5
WHEN day_of_week="Saturday" THEN 6
ELSE
7
END
)
SELECT
FORMAT_TIMESTAMP('%I %p', ActivityMinute) AS hour,
COUNT(sleep_Id) AS total_sleep_min,
FROM
`capstone-project-22.Bellabeat_data.minuteActivity_merged`
GROUP BY
hour
User engagement dropped by 36% throughout the whole period.
The calories burned has strong correlations with both the MET level and exercise intensity.
Some users have a relatively larger range of heart rate, from below 40 bpm to over 200 bpm.
As the number of steps per minute goes up, the average heart rate rises as well.
Heart rate and the calories burned have positive correlation.
In terms of day of week, Saturday is the most active day with an average distance of 5.8km while Sunday is the least with 5km. This is the same for the average active time by day of week, on average 3.8 hours per day of week.
During a day, the average intensities from 7am to 10pm are above the total average.
On average, users meet the requirement as healthy sleepers, whereas only a little more than half of the total days reach the recommended hours of sleep.
Generally users sleep between 11pm to 7am.
Larger data sample size is necessary for more accurate analysis results. And more different types of data are also needed in order to have a broader understanding in the analysis result, this may include user background information such as gender, age, occupation, e.t.c.
More recently collected data can also increase the reliability to learn about the up-to-date situation.
Bellabeat can increase the user engagement over time by encouraging systems such as a scoring system for user to keep track of how healthy is their exercise habit, how close they are to their health goals, or some rewards/discounts on other Bellabeat products or services if they meet certain requirements. This could eventually increase customer loyalty.
Bellabeat could also try to make the products as comfortable as possible for wearing such as lightweight, safe for skin, and hypoallergenic. Features like waterproof and long-lasting battery are also important for outdoor usage.
Daily or weekly tasks/challenges could be introduced to motivate users for more active movement in terms of intensity or length of exercise time.
Introduce notification which can be set by users or the system to remind them if they are not active for a certain period of time, do not meet a certain level of movement, heart rate is unusual/having irregular rhythm/in danger zone according to their age or an adjustable threshold bpm.
Bellabeat can also improve user sleep quality by sending notifications based on the user’s historic wake-up time on the next day to remind user the time to sleep to meet the recommended sleep hours.
New features such as blood Oxygen Level or fall detection could also be considered to provide users more extensive insights of their health condition.
Allow user to set the device to call for emergency services for specific conditions such as the heart rate exceed the range of values for certain amount of time.
Avoid manually input by users as much as possible to prevent the irregular frequency of data collection and human error.