plotly
pandas
SQL
pandasql
yammer
case-studies
hypothesis-testing
data-visualisation
python
ipynb
]
In this study, we will be using a few common technologies:
# prerequisites
!pip install pandasql
!pip install plotly
# data tools
import pandas as pd
import pandasql # for querying pandas dataframes
from pandasql import sqldf
# plotting tools
import plotly.graph_objects as go
import plotly.express as px
import plotly.io as pio # output Plotly charts to HMTL
This notebook features the first case study in a series of three that dives into the Yammer dataset. For our analysis we will be considering Yammer’s core metrics (user engagement, retention and growth) and analyzing product-specific usage metrics (e.g. number of times someone views another user’s profile). For clarification, Yammer is a social network for the workplace whose primary goal is to drive better product and business decisions using data. While this specific dataset is fabricated due to privacy and security reasons, it is similar in structure to Yammer’s actual data. With all that said–let’s begin our study…
emails_path = '../src/yammer_emails.csv'
events_path = '../src/yammer_events.csv'
users_path = '../src/yammer_users.csv'
yammer_emails = pd.read_csv(emails_path, sep=',', header=0)
yammer_emails
user_id | occurred_at | action | user_type | |
---|---|---|---|---|
0 | 0.0 | 2014-05-06 09:30:00 | sent_weekly_digest | 1.0 |
1 | 0.0 | 2014-05-13 09:30:00 | sent_weekly_digest | 1.0 |
2 | 0.0 | 2014-05-20 09:30:00 | sent_weekly_digest | 1.0 |
3 | 0.0 | 2014-05-27 09:30:00 | sent_weekly_digest | 1.0 |
4 | 0.0 | 2014-06-03 09:30:00 | sent_weekly_digest | 1.0 |
... | ... | ... | ... | ... |
90384 | 18814.0 | 2014-08-31 12:12:26 | email_open | 3.0 |
90385 | 18814.0 | 2014-08-31 12:12:57 | email_clickthrough | 3.0 |
90386 | 18815.0 | 2014-08-31 13:39:56 | sent_reengagement_email | 2.0 |
90387 | 18815.0 | 2014-08-31 13:40:14 | email_open | 2.0 |
90388 | 18815.0 | 2014-08-31 13:40:47 | email_clickthrough | 2.0 |
90389 rows × 4 columns
yammer_events = pd.read_csv(events_path, sep=',', header=0)
yammer_events
user_id | occurred_at | event_type | event_name | location | device | user_type | |
---|---|---|---|---|---|---|---|
0 | 10522.0 | 2014-05-02 11:02:39 | engagement | login | Japan | dell inspiron notebook | 3.0 |
1 | 10522.0 | 2014-05-02 11:02:53 | engagement | home_page | Japan | dell inspiron notebook | 3.0 |
2 | 10522.0 | 2014-05-02 11:03:28 | engagement | like_message | Japan | dell inspiron notebook | 3.0 |
3 | 10522.0 | 2014-05-02 11:04:09 | engagement | view_inbox | Japan | dell inspiron notebook | 3.0 |
4 | 10522.0 | 2014-05-02 11:03:16 | engagement | search_run | Japan | dell inspiron notebook | 3.0 |
... | ... | ... | ... | ... | ... | ... | ... |
340827 | 18815.0 | 2014-08-31 13:41:46 | engagement | like_message | Ireland | dell inspiron notebook | 2.0 |
340828 | 18815.0 | 2014-08-31 13:42:11 | engagement | home_page | Ireland | dell inspiron notebook | 2.0 |
340829 | 18815.0 | 2014-08-31 13:42:43 | engagement | send_message | Ireland | dell inspiron notebook | 2.0 |
340830 | 18815.0 | 2014-08-31 13:43:07 | engagement | home_page | Ireland | dell inspiron notebook | 2.0 |
340831 | 18815.0 | 2014-08-31 13:43:42 | engagement | like_message | Ireland | dell inspiron notebook | 2.0 |
340832 rows × 7 columns
yammer_users = pd.read_csv(users_path, sep=',', header=0)
yammer_users
user_id | created_at | company_id | language | activated_at | state | |
---|---|---|---|---|---|---|
0 | 0.0 | 2013-01-01 20:59:39 | 5737.0 | english | 2013-01-01 21:01:07 | active |
1 | 1.0 | 2013-01-01 13:07:46 | 28.0 | english | NaN | pending |
2 | 2.0 | 2013-01-01 10:59:05 | 51.0 | english | NaN | pending |
3 | 3.0 | 2013-01-01 18:40:36 | 2800.0 | german | 2013-01-01 18:42:02 | active |
4 | 4.0 | 2013-01-01 14:37:51 | 5110.0 | indian | 2013-01-01 14:39:05 | active |
... | ... | ... | ... | ... | ... | ... |
19061 | 19061.0 | 2014-08-31 13:21:16 | 2156.0 | chinese | 2014-08-31 13:22:50 | active |
19062 | 19062.0 | 2014-08-31 19:21:23 | 7520.0 | spanish | NaN | pending |
19063 | 19063.0 | 2014-08-31 07:10:41 | 72.0 | spanish | 2014-08-31 07:12:09 | active |
19064 | 19064.0 | 2014-08-31 17:45:18 | 2.0 | english | NaN | pending |
19065 | 19065.0 | 2014-08-31 19:29:19 | 8352.0 | italian | NaN | pending |
19066 rows × 6 columns
It’s Tuesday, September 2, 2014. You’ve just been informed by the head of the Yammer Product team that user engagement is down. Here’s the chart of weekly active users that you’ve been provided:
q = """
SELECT strftime('%Y-%m-%d', occurred_at, 'weekday 0', '-6 days') AS week, COUNT(DISTINCT user_id) AS weekly_active_users
FROM yammer_events
WHERE occurred_at >= '2014-05-01 00:00:00'
AND occurred_at < '2014-09-01 00:00:00'
AND event_type = 'engagement'
GROUP BY 1
ORDER BY 1;
"""
weekly_active_users = pandasql.sqldf(q, globals())
week = weekly_active_users['week']
count_users = weekly_active_users['weekly_active_users']
fig1 = go.Figure()
fig1.add_trace(go.Scatter(x=week, y=count_users, mode='lines', name='num_users'))
fig1.update_layout(title='Weekly Active Users', xaxis_title='Week of', yaxis_title='Number of users')
pio.write_html(fig1, file='figures/fig1.html', auto_open=True)
The above chart shows the number of engaged users each week. Yammer defines engagement as having made some type of server call by interacting with the product (showed in the data as events of type engagement
). Any point in this chart can be interpreted as “the number of users who logged at least one engagement event during the week starting on that date”.
We are responsible for determining what caused the dip at the end of the chart shown above, and, if appropriate, recommending solutions for the problem.
In this study, we will be considering the following metrics:
“Do we see the engagement levels gradually decreasing over time or only a one time sudden decrease? Is the decrease specific to certain days of the week or certain times of the day?”
# engagement rate = num of engagments / num of users
q = """
SELECT strftime('%Y-%m-%d', occurred_at) AS day, COUNT(*) * 1.00 / COUNT(DISTINCT user_id) AS engagement_rate
FROM yammer_events
WHERE occurred_at >= '2014-05-01 00:00:00'
AND occurred_at < '2014-09-01 00:00:00'
AND event_type = 'engagement'
GROUP BY 1
ORDER BY 1
"""
daily_engagement = pandasql.sqldf(q, globals())
daily_engagement.head()
day | engagement_rate | |
---|---|---|
0 | 2014-05-01 | 9.808874 |
1 | 2014-05-02 | 10.765363 |
2 | 2014-05-03 | 8.903448 |
3 | 2014-05-04 | 8.734177 |
4 | 2014-05-05 | 10.050584 |
day = daily_engagement['day']
engagement_rate = daily_engagement['engagement_rate']
fig2 = go.Figure()
fig2.add_trace(go.Scatter(x=day, y=engagement_rate, mode='lines', name='engagement_rate'))
fig2.update_layout(title='Daily Engagement', xaxis_title='Day', yaxis_title='Engagement rate')
pio.write_html(fig2, file='figures/fig2.html', auto_open=True)
q = """
SELECT strftime('%w', occurred_at) AS dow,
CASE strftime('%w', occurred_at)
WHEN '0' THEN 'Sun'
WHEN '1' THEN 'Mon'
WHEN '2' THEN 'Tues'
WHEN '3' THEN 'Wed'
WHEN '4' THEN 'Thurs'
WHEN '5' THEN 'Fri'
WHEN '6' THEN 'Sat'
ELSE NULL END AS day_of_week,
COUNT(*) * 1.00 / COUNT(DISTINCT user_id) AS active_users
FROM yammer_events
WHERE occurred_at >= '2014-05-01 00:00:00'
AND occurred_at < '2014-09-01 00:00:00'
AND event_type = 'engagement'
GROUP BY 1
ORDER BY 1
"""
dow_engagement = pandasql.sqldf(q, globals())
dow_engagement.head()
dow | day_of_week | active_users | |
---|---|---|---|
0 | 0 | Sun | 9.633461 |
1 | 1 | Mon | 14.766006 |
2 | 2 | Tues | 16.405758 |
3 | 3 | Wed | 17.235331 |
4 | 4 | Thurs | 17.961189 |
day = dow_engagement['day_of_week']
engagement_rate = dow_engagement['active_users']
fig3 = go.Figure()
fig3.add_trace(go.Bar(x=day, y=engagement_rate))
fig3.update_layout(title='Engagement by Day of Week', xaxis_title='Day of Week', yaxis_title='Engagement rate')
pio.write_html(fig3, file='figures/fig3.html', auto_open=True)
q = """
SELECT strftime('%H', occurred_at) AS hour,
COUNT(*) * 1.00 / COUNT(DISTINCT user_id) AS engagement_rate
FROM yammer_events
WHERE occurred_at >= '2014-05-01 00:00:00'
AND occurred_at < '2014-9-01 00:00:00'
AND event_type = 'engagement'
GROUP BY 1
ORDER BY 1
"""
time_day = pandasql.sqldf(q, globals())
time_day.head()
hour | engagement_rate | |
---|---|---|
0 | 00 | 9.787736 |
1 | 01 | 9.410628 |
2 | 02 | 8.578125 |
3 | 03 | 9.088710 |
4 | 04 | 9.169014 |
hour = time_day['hour']
engagement_rate = time_day['engagement_rate']
fig4 = go.Figure()
fig4.add_trace(go.Bar(x=hour, y=engagement_rate))
fig4.update_layout(title='Engagement by Time of Day', xaxis_title='Hour', yaxis_title='Engagement rate')
pio.write_html(fig4, file='figures/fig4.html', auto_open=True)
Conclusions:
“Since the drop in users were gradual over time and not localized to a particular day of the week or time of day, we should now move on and identify if there is an issue with acquiring new users or activating new users”
q = """
SELECT
strftime('%Y-%m-%d', created_at) AS day,
COUNT(*) AS all_users,
COUNT(CASE WHEN activated_at IS NOT NULL THEN user_id ELSE NULL END) AS activated_users
FROM yammer_users
WHERE created_at >= '2014-05-01 00:00:00'
AND created_at < '2014-09-01 00:00:00'
GROUP BY 1
ORDER BY 1
"""
growth = pandasql.sqldf(q, globals())
day = growth['day']
all_users = growth['all_users']
activated_users = growth['activated_users']
growth.head()
day | all_users | activated_users | |
---|---|---|---|
0 | 2014-05-01 | 73 | 34 |
1 | 2014-05-02 | 57 | 30 |
2 | 2014-05-03 | 19 | 8 |
3 | 2014-05-04 | 22 | 9 |
4 | 2014-05-05 | 58 | 24 |
fig5 = go.Figure()
fig5.add_trace(go.Scatter(x=day, y=all_users, mode='lines', name='all_users'))
fig5.add_trace(go.Scatter(x=day, y=activated_users, mode='lines', name='activated_users'))
fig5.update_layout(title='Daily signups', xaxis_title='Date', yaxis_title='Number of users')
pio.write_html(fig5, file='figures/fig5.html', auto_open=True)
Conclusions:
By looking at the assumed UX flow, there are four major types of engagement events:
We want to group these together, and see if there is a more pronounced drop in engagement for either groups
q = """
SELECT DISTINCT event_name
FROM yammer_events
"""
events = pandasql.sqldf(q, globals())
events.head(50)
event_name | |
---|---|
0 | login |
1 | home_page |
2 | like_message |
3 | view_inbox |
4 | search_run |
5 | send_message |
6 | search_autocomplete |
7 | search_click_result_10 |
8 | create_user |
9 | enter_email |
10 | enter_info |
11 | complete_signup |
12 | search_click_result_7 |
13 | search_click_result_8 |
14 | search_click_result_1 |
15 | search_click_result_3 |
16 | search_click_result_2 |
17 | search_click_result_5 |
18 | search_click_result_6 |
19 | search_click_result_9 |
20 | search_click_result_4 |
q = """
SELECT
strftime('%Y-%m-%d', occurred_at, 'weekday 0', '-6 days') AS week_of,
COUNT(DISTINCT CASE WHEN event_name IN ('create_user', 'enter_email', 'enter_info', 'complete_signup') THEN user_id ELSE NULL END) AS signup,
COUNT(DISTINCT CASE WHEN event_name IN ('login', 'home_page') THEN user_id ELSE NULL END) AS login,
COUNT(DISTINCT CASE WHEN event_name IN ('view_inbox', 'like_message', 'send_message') THEN user_id ELSE NULL END) AS inbox,
COUNT(DISTINCT CASE WHEN event_name IN ('search_run', 'search_autocomplete', 'search_click_result_1', 'search_click_result_2', 'search_click_result_3', 'search_click_result_4', 'search_click_result_5', 'search_click_result_6', 'search_click_result_7', 'search_click_result_8', 'search_click_result_9', 'search_click_result_10') THEN user_id ELSE NULL END) AS search_and_click
FROM yammer_events
WHERE occurred_at >= '2014-05-01 00:00:00'
AND occurred_at < '2014-09-01 00:00:00'
GROUP BY 1
ORDER BY 1
"""
event_engagement = pandasql.sqldf(q, globals())
week_of = event_engagement['week_of']
signup = event_engagement['signup']
login = event_engagement['login']
inbox = event_engagement['inbox']
search_and_click = event_engagement['search_and_click']
fig6 = go.Figure()
fig6.add_trace(go.Scatter(x=week_of, y=signup, mode='lines', name='signup'))
fig6.add_trace(go.Scatter(x=week_of, y=login, mode='lines', name='login'))
fig6.add_trace(go.Scatter(x=week_of, y=inbox, mode='lines', name='inbox'))
fig6.add_trace(go.Scatter(x=week_of, y=search_and_click, mode='lines', name='search and click'))
fig6.update_layout(title='Engagement by Event Type', xaxis_title='Week of', yaxis_title='Number of users')
pio.write_html(fig6, file='figures/fig6.html', auto_open=True)
Conclusions:
“If the drop is localized to a particular region then there could be an issue related to that.”
q = """
-- Selects Top 5 countries of all-time
WITH region AS (SELECT location, COUNT(user_id) AS num_users FROM yammer_events WHERE occurred_at >= '2014-05-01 00:00:00' AND occurred_at < '2014-09-01 00:00:00' AND event_type = 'engagement' GROUP BY 1 ORDER BY 2 DESC),
top AS (SELECT region.location, region.num_users, ROW_NUMBER() OVER (ORDER BY num_users DESC) AS rank FROM region)
SELECT strftime('%Y-%m-%d', e.occurred_at, 'weekday 0', '-6 days') AS week_of,
e.location,
COUNT(DISTINCT e.user_id) AS num_users
FROM yammer_events e
INNER JOIN top
ON e.location = top.location
WHERE occurred_at >= '2014-05-01 00:00:00'
AND occurred_at < '2014-09-01 00:00:00'
AND event_type = 'engagement'
AND top.rank <=5
GROUP BY 1, 2
ORDER BY 1, 3 DESC
"""
top_regions = pandasql.sqldf(q, globals())
top_regions.head(10)
week_of | location | num_users | |
---|---|---|---|
0 | 2014-04-28 | United States | 203 |
1 | 2014-04-28 | Japan | 47 |
2 | 2014-04-28 | Germany | 42 |
3 | 2014-04-28 | France | 34 |
4 | 2014-04-28 | United Kingdom | 33 |
5 | 2014-05-05 | United States | 294 |
6 | 2014-05-05 | Japan | 76 |
7 | 2014-05-05 | Germany | 72 |
8 | 2014-05-05 | France | 53 |
9 | 2014-05-05 | United Kingdom | 49 |
fig7 = px.line(top_regions, x="week_of", y="num_users", color="location", title='Weekly Engagement by Region')
#iplot(fig7)
pio.write_html(fig7, file='figures/fig7.html', auto_open=True)
Conclusions:
“It is also good to check if the drop in engagement rate is localized to a type of device. We’d want to compare laptops vs phones vs tablets and then perhaps iOS vs androids, or macs vs PCs.”
We’ll categorize our users by their device type, specifically:
q = """
SELECT DISTINCT device
FROM yammer_events
"""
devices = pandasql.sqldf(q, globals())
devices.head(50)
device | |
---|---|
0 | dell inspiron notebook |
1 | iphone 5 |
2 | iphone 4s |
3 | windows surface |
4 | macbook air |
5 | iphone 5s |
6 | macbook pro |
7 | kindle fire |
8 | ipad mini |
9 | nexus 7 |
10 | nexus 5 |
11 | samsung galaxy s4 |
12 | lenovo thinkpad |
13 | samsumg galaxy tablet |
14 | acer aspire notebook |
15 | asus chromebook |
16 | htc one |
17 | nokia lumia 635 |
18 | samsung galaxy note |
19 | acer aspire desktop |
20 | mac mini |
21 | hp pavilion desktop |
22 | dell inspiron desktop |
23 | ipad air |
24 | amazon fire phone |
25 | nexus 10 |
q = """
SELECT
strftime('%Y-%m-%d', occurred_at, 'weekday 0', '-6 days') AS week_of,
COUNT(DISTINCT user_id) AS weekly_active_users,
COUNT(DISTINCT CASE WHEN device IN ('dell inspiron notebook', 'macbook air', 'macbook pro', 'lenovo thinkpad', 'acer aspire notebook', 'asus chromebook', 'acer aspire desktop', 'mac mini', 'hp pavillion desktop', 'dell inspiron desktop') THEN user_id ELSE NULL END) AS desktop,
COUNT(DISTINCT CASE WHEN device IN ('iphone 5', 'iphone 4s', 'iphone 5s', 'nexus 5', 'samsung galaxy s4', 'htc one', 'nokia lumia 635', 'samsung galaxy note', 'amazon fire phone') THEN user_id ELSE NULL END) AS phone,
COUNT(DISTINCT CASE WHEN device IN ('windows surface', 'kindle fire', 'ipad mini', 'nexus 7', 'samsung galaxy tablet', 'ipad air', 'nexus 10') THEN user_id ELSE NULL END) AS tablet
FROM yammer_events
WHERE occurred_at >= '2014-05-01 00:00:00'
AND occurred_at < '2014-09-01 00:00:00'
AND event_type = 'engagement'
GROUP BY 1
ORDER BY 1
"""
device_engagement = pandasql.sqldf(q, globals())
week_of = device_engagement['week_of']
weekly_active_users = device_engagement['weekly_active_users']
desktop = device_engagement['desktop']
phone = device_engagement['phone']
tablet = device_engagement['tablet']
fig8 = go.Figure()
fig8.add_trace(go.Scatter(x=week_of, y=weekly_active_users, mode='lines', name='weekly active users'))
fig8.add_trace(go.Scatter(x=week_of, y=desktop, mode='lines', name='desktop'))
fig8.add_trace(go.Scatter(x=week_of, y=phone, mode='lines', name='phone'))
fig8.add_trace(go.Scatter(x=week_of, y=tablet, mode='lines', name='tablet'))
fig8.update_layout(title='Engagement by Device Type', xaxis_title='Week of', yaxis_title='Number of users')
pio.write_html(fig8, file='figures/fig8.html', auto_open=True)
From the above plot, we can see that our engagement decreased most significantly on the mobile (“phone”) Yammer platform.
We’ll further investigate by looking at the differences in engagement across two mobile OS types:
q = """
SELECT
strftime('%Y-%m-%d', occurred_at, 'weekday 0', '-6 days') AS week_of,
COUNT(DISTINCT CASE WHEN device IN ('iphone 4s', 'iphone 5', 'iphone 5s') THEN user_id ELSE NULL END) AS "iOS",
COUNT(DISTINCT CASE WHEN device IN ('nexus 5', 'samsung galaxy s4', 'htc one', 'nokia lumia 635', 'samsung galaxy note', 'amazon fire phone') THEN user_id ELSE NULL END) AS "Other"
FROM yammer_events
WHERE occurred_at >= '2014-05-01 00:00:00'
AND occurred_at < '2014-09-01 00:00:00'
AND event_type = 'engagement'
GROUP BY 1
ORDER BY 1
"""
os_engagement = pandasql.sqldf(q, globals())
week_of = os_engagement['week_of']
iOS = os_engagement['iOS']
Other = os_engagement['Other']
fig9 = go.Figure()
fig9.add_trace(go.Scatter(x=week_of, y=iOS, mode='lines', name='iOS'))
fig9.add_trace(go.Scatter(x=week_of, y=Other, mode='lines', name='Other'))
fig9.update_layout(title='Engagement by Phone OS', xaxis_title='Week of', yaxis_title='Number of users')
pio.write_html(fig9, file='figures/fig9.html', auto_open=True)
Conclusions:
phone
users.“Since activation rate is normal, we know the issue is not associated with growth (acquiring new users). The drop will likely be due to disengagement from existing users, so we should do a cohort analysis allowing us to compare users at different life stages with Yammer.”
q = """
SELECT
strftime('%Y-%m-%d', ue.occurred_at, 'weekday 0', '-6 days') AS week_of,
COUNT(DISTINCT CASE WHEN ue.user_age < 7 THEN ue.user_id ELSE NULL END) AS '<1 week old',
COUNT(DISTINCT CASE WHEN ue.user_age >= 7 AND ue.user_age < 14 THEN ue.user_id ELSE NULL END) AS '1 week old',
COUNT(DISTINCT CASE WHEN ue.user_age >= 14 AND ue.user_age < 21 THEN ue.user_id ELSE NULL END) AS '2 weeks old',
COUNT(DISTINCT CASE WHEN ue.user_age >= 21 AND ue.user_age < 28 THEN ue.user_id ELSE NULL END) AS '3 weeks old',
COUNT(DISTINCT CASE WHEN ue.user_age >= 28 AND ue.user_age < 35 THEN ue.user_id ELSE NULL END) AS '4 weeks old',
COUNT(DISTINCT CASE WHEN ue.user_age >= 35 AND ue.user_age < 42 THEN ue.user_id ELSE NULL END) AS '5 weeks old',
COUNT(DISTINCT CASE WHEN ue.user_age >= 42 AND ue.user_age < 49 THEN ue.user_id ELSE NULL END) AS '6 weeks old',
COUNT(DISTINCT CASE WHEN ue.user_age >= 49 AND ue.user_age < 56 THEN ue.user_id ELSE NULL END) AS '7 weeks old',
COUNT(DISTINCT CASE WHEN ue.user_age >= 56 AND ue.user_age < 63 THEN ue.user_id ELSE NULL END) AS '8 weeks old',
COUNT(DISTINCT CASE WHEN ue.user_age >= 63 AND ue.user_age < 70 THEN ue.user_id ELSE NULL END) AS '9 weeks old',
COUNT(DISTINCT CASE WHEN ue.user_age >= 70 THEN ue.user_id ELSE NULL END) AS '10+ weeks old'
FROM (SELECT e.occurred_at, u.user_id, CAST((julianday('2014-09-01') - julianday(u.activated_at)) AS INT) AS user_age
FROM yammer_users u
JOIN yammer_events e
ON e.user_id = u.user_id
WHERE e.occurred_at >= '2014-05-01 00:00:00'
AND e.occurred_at < '2014-09-01 00:00:00'
AND e.event_type = 'engagement'
AND u.activated_at IS NOT NULL) ue
GROUP BY 1
ORDER BY 1
"""
age_cohort = pandasql.sqldf(q, globals())
age_cohort.head(50)
week_of | <1 week old | 1 week old | 2 weeks old | 3 weeks old | 4 weeks old | 5 weeks old | 6 weeks old | 7 weeks old | 8 weeks old | 9 weeks old | 10+ weeks old | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2014-04-28 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 701 |
1 | 2014-05-05 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1054 |
2 | 2014-05-12 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1094 |
3 | 2014-05-19 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1147 |
4 | 2014-05-26 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1113 |
5 | 2014-06-02 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1173 |
6 | 2014-06-09 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1219 |
7 | 2014-06-16 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1263 |
8 | 2014-06-23 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 210 | 1039 |
9 | 2014-06-30 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 199 | 151 | 921 |
10 | 2014-07-07 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 223 | 130 | 100 | 902 |
11 | 2014-07-14 | 0 | 0 | 0 | 0 | 0 | 0 | 215 | 152 | 82 | 62 | 834 |
12 | 2014-07-21 | 0 | 0 | 0 | 0 | 0 | 228 | 144 | 95 | 60 | 44 | 792 |
13 | 2014-07-28 | 0 | 0 | 0 | 0 | 234 | 156 | 91 | 83 | 43 | 30 | 806 |
14 | 2014-08-04 | 0 | 0 | 0 | 189 | 154 | 82 | 52 | 52 | 34 | 24 | 679 |
15 | 2014-08-11 | 0 | 0 | 250 | 126 | 94 | 59 | 33 | 39 | 33 | 19 | 562 |
16 | 2014-08-18 | 0 | 259 | 163 | 69 | 64 | 40 | 19 | 26 | 26 | 15 | 522 |
17 | 2014-08-25 | 266 | 173 | 82 | 48 | 47 | 31 | 20 | 23 | 14 | 15 | 475 |
week_of = age_cohort['week_of']
y0 = age_cohort['<1 week old']
y1 = age_cohort['1 week old']
y2 = age_cohort['2 weeks old']
y3 = age_cohort['3 weeks old']
y4 = age_cohort['4 weeks old']
y5 = age_cohort['5 weeks old']
y6 = age_cohort['6 weeks old']
y7 = age_cohort['7 weeks old']
y8 = age_cohort['8 weeks old']
y9 = age_cohort['9 weeks old']
y10 = age_cohort['10+ weeks old']
fig10 = go.Figure()
fig10.add_trace(go.Scatter(x=week_of, y=y0, mode='lines', name='<1 week old'))
fig10.add_trace(go.Scatter(x=week_of, y=y1, mode='lines', name='1 week old'))
fig10.add_trace(go.Scatter(x=week_of, y=y2, mode='lines', name='2 weeks old'))
fig10.add_trace(go.Scatter(x=week_of, y=y3, mode='lines', name='3 weeks old'))
fig10.add_trace(go.Scatter(x=week_of, y=y4, mode='lines', name='4 weeks old'))
fig10.add_trace(go.Scatter(x=week_of, y=y5, mode='lines', name='5 weeks old'))
fig10.add_trace(go.Scatter(x=week_of, y=y6, mode='lines', name='6 weeks old'))
fig10.add_trace(go.Scatter(x=week_of, y=y7, mode='lines', name='7 weeks old'))
fig10.add_trace(go.Scatter(x=week_of, y=y8, mode='lines', name='8 weeks old'))
fig10.add_trace(go.Scatter(x=week_of, y=y9, mode='lines', name='9 weeks old'))
fig10.add_trace(go.Scatter(x=week_of, y=y10, mode='lines', name='10+ weeks old'))
fig10.update_layout(title='Engagement by User Age Cohort', xaxis_title='Week of', yaxis_title='Number of Users')
pio.write_html(fig10, file='figures/fig10.html', auto_open=True)
Conclusions:
Yammer sends users two types of email to increase engagement:
sent_weekly_digest
)sent_reengagement_email
)We want to see if something has gone wrong with these emails by looking into their open and click-through rates.
q = """
SELECT DISTINCT action
FROM yammer_emails
"""
emails = pandasql.sqldf(q, globals())
emails.head()
action | |
---|---|
0 | sent_weekly_digest |
1 | email_open |
2 | email_clickthrough |
3 | sent_reengagement_email |
Note:
Some preliminary analysis done on email_open
rates suggest that >97% of all engagement emails sent are opened within the first hour. Furthermore, less than 2% of engagment emails are opened after the 24-hour mark. Because of this, we’ll be looking at email_open
and email_clickthrough
rates that occur within a 24-hour window.
q = """
SELECT
strftime('%Y-%m-%d', e.occurred_at, 'weekday 0', '-6 days') AS week_of,
COUNT(CASE WHEN e.action = 'sent_weekly_digest' THEN e.user_id ELSE NULL END) AS emails_sent,
COUNT(CASE WHEN e.action = 'sent_weekly_digest' THEN e1.user_id ELSE NULL END) AS emails_opened,
COUNT(CASE WHEN e.action = 'sent_weekly_digest' THEN e2.user_id ELSE NULL END) AS emails_ct,
COUNT(CASE WHEN e.action = 'sent_reengagement_email' THEN e.user_id ELSE NULL END) AS reengagements_sent,
COUNT(CASE WHEN e.action = 'sent_reengagement_email' THEN e1.user_id ELSE NULL END) AS reengagements_opened,
COUNT(CASE WHEN e.action = 'sent_reengagement_email' THEN e2.user_id ELSE NULL END) AS reengagements_ct
FROM yammer_emails e
LEFT JOIN yammer_emails e1
ON e.user_id = e1.user_id
AND e1.action = 'email_open'
AND e1.occurred_at >= e.occurred_at
AND e1.occurred_at < datetime(e.occurred_at, '+24 hours')
LEFT JOIN yammer_emails e2
ON e.user_id = e2.user_id
AND e2.action = 'email_clickthrough'
AND e2.occurred_at >= e.occurred_at
AND e2.occurred_at < datetime(e.occurred_at, '+24 hours')
WHERE e.occurred_at >= '2014-05-01 00:00:00'
AND e.occurred_at < '2014-09-01 00:00:00'
GROUP BY 1
ORDER BY 1
"""
email_counts = pandasql.sqldf(q, globals())
email_counts.head()
week_of | emails_sent | emails_opened | emails_ct | reengagements_sent | reengagements_opened | reengagements_ct | |
---|---|---|---|---|---|---|---|
0 | 2014-04-28 | 908 | 246 | 105 | 98 | 86 | 82 |
1 | 2014-05-05 | 2602 | 776 | 306 | 164 | 143 | 128 |
2 | 2014-05-12 | 2665 | 809 | 332 | 175 | 162 | 147 |
3 | 2014-05-19 | 2733 | 834 | 348 | 179 | 161 | 150 |
4 | 2014-05-26 | 2822 | 868 | 314 | 179 | 158 | 139 |
week_of = email_counts['week_of']
emails_sent = email_counts['emails_sent']
emails_opened = email_counts['emails_opened']
emails_ct = email_counts['emails_ct']
reengagements_sent = email_counts['reengagements_sent']
reengagements_opened = email_counts['reengagements_opened']
reengagements_ct = email_counts['reengagements_ct']
fig11 = go.Figure()
fig11.add_trace(go.Scatter(x=week_of, y=emails_sent, mode='lines', name='emails sent'))
fig11.add_trace(go.Scatter(x=week_of, y=emails_opened, mode='lines', name='emails opened'))
fig11.add_trace(go.Scatter(x=week_of, y=emails_ct, mode='lines', name='emails clicked-through'))
fig11.add_trace(go.Scatter(x=week_of, y=reengagements_sent, mode='lines', name='reengagements sent'))
fig11.add_trace(go.Scatter(x=week_of, y=reengagements_opened, mode='lines', name='reengagements opened'))
fig11.add_trace(go.Scatter(x=week_of, y=reengagements_ct, mode='lines', name='reengagements clicked-through'))
fig11.update_layout(title='Email Open and Click-Through Rates', xaxis_title='Week of', yaxis_title='Number of emails')
pio.write_html(fig11, file='figures/fig11.html', auto_open=True)
Conclusions:
From our analysis, we find that our mobile users are becoming less engaged with the platform. We also see email click-through rates decreasing, which tells us that our email reengagement and retention campaigns aren’t going so well. Our next step is to reach out to the Engineering and Marketing teams to check for potential issues (e.g. product releases, marketing campaigns) that have contributed to the decline in engageent. We’ve successfully ruled-out major bugs (e.g., platform outages, regional differences, signup funnel issues) and have a good understanding of what steps to take next (A/B testing, local segmentation, feature roll-backs, improved email campaigns).
The data and problem description was provided to us by Mode Analytics. The approach was based off the work of Jodi Zhang. A lot of time and energy was spent by all investigating each query and analysing goal and value metrics.
2021-05-26 00:00:00 -0700 - Written by Jonathan Logan Moran