[ plotly pandas SQL pandasql yammer case-studies hypothesis-testing data-visualisation python ipynb ]

Investigating a Drop in User Engagement

In this study, we will be using a few common technologies:

SQL (SQLite) for data manipulation/exploration
Python (pandas) for data frames
Plotly/Jupyter notebook for interactive charts/data visualization

Programming

# prerequisites
!pip install pandasql
!pip install plotly

# data tools
import pandas as pd
import pandasql					# for querying pandas dataframes
from pandasql import sqldf

# plotting tools
import plotly.graph_objects as go
import plotly.express as px

import plotly.io as pio				# output Plotly charts to HMTL

Task overview

This notebook features the first case study in a series of three that dives into the Yammer dataset. For our analysis we will be considering Yammer’s core metrics (user engagement, retention and growth) and analyzing product-specific usage metrics (e.g. number of times someone views another user’s profile). For clarification, Yammer is a social network for the workplace whose primary goal is to drive better product and business decisions using data. While this specific dataset is fabricated due to privacy and security reasons, it is similar in structure to Yammer’s actual data. With all that said–let’s begin our study…

Collecting our dataset

emails_path = '../src/yammer_emails.csv'
events_path = '../src/yammer_events.csv'
users_path = '../src/yammer_users.csv'

yammer_emails = pd.read_csv(emails_path, sep=',', header=0)
yammer_emails

	user_id	occurred_at	action	user_type
0	0.0	2014-05-06 09:30:00	sent_weekly_digest	1.0
1	0.0	2014-05-13 09:30:00	sent_weekly_digest	1.0
2	0.0	2014-05-20 09:30:00	sent_weekly_digest	1.0
3	0.0	2014-05-27 09:30:00	sent_weekly_digest	1.0
4	0.0	2014-06-03 09:30:00	sent_weekly_digest	1.0
...	...	...	...	...
90384	18814.0	2014-08-31 12:12:26	email_open	3.0
90385	18814.0	2014-08-31 12:12:57	email_clickthrough	3.0
90386	18815.0	2014-08-31 13:39:56	sent_reengagement_email	2.0
90387	18815.0	2014-08-31 13:40:14	email_open	2.0
90388	18815.0	2014-08-31 13:40:47	email_clickthrough	2.0

90389 rows × 4 columns

yammer_events = pd.read_csv(events_path, sep=',', header=0)
yammer_events

	user_id	occurred_at	event_type	event_name	location	device	user_type
0	10522.0	2014-05-02 11:02:39	engagement	login	Japan	dell inspiron notebook	3.0
1	10522.0	2014-05-02 11:02:53	engagement	home_page	Japan	dell inspiron notebook	3.0
2	10522.0	2014-05-02 11:03:28	engagement	like_message	Japan	dell inspiron notebook	3.0
3	10522.0	2014-05-02 11:04:09	engagement	view_inbox	Japan	dell inspiron notebook	3.0
4	10522.0	2014-05-02 11:03:16	engagement	search_run	Japan	dell inspiron notebook	3.0
...	...	...	...	...	...	...	...
340827	18815.0	2014-08-31 13:41:46	engagement	like_message	Ireland	dell inspiron notebook	2.0
340828	18815.0	2014-08-31 13:42:11	engagement	home_page	Ireland	dell inspiron notebook	2.0
340829	18815.0	2014-08-31 13:42:43	engagement	send_message	Ireland	dell inspiron notebook	2.0
340830	18815.0	2014-08-31 13:43:07	engagement	home_page	Ireland	dell inspiron notebook	2.0
340831	18815.0	2014-08-31 13:43:42	engagement	like_message	Ireland	dell inspiron notebook	2.0

340832 rows × 7 columns

yammer_users = pd.read_csv(users_path, sep=',', header=0)
yammer_users

	user_id	created_at	company_id	language	activated_at	state
0	0.0	2013-01-01 20:59:39	5737.0	english	2013-01-01 21:01:07	active
1	1.0	2013-01-01 13:07:46	28.0	english	NaN	pending
2	2.0	2013-01-01 10:59:05	51.0	english	NaN	pending
3	3.0	2013-01-01 18:40:36	2800.0	german	2013-01-01 18:42:02	active
4	4.0	2013-01-01 14:37:51	5110.0	indian	2013-01-01 14:39:05	active
...	...	...	...	...	...	...
19061	19061.0	2014-08-31 13:21:16	2156.0	chinese	2014-08-31 13:22:50	active
19062	19062.0	2014-08-31 19:21:23	7520.0	spanish	NaN	pending
19063	19063.0	2014-08-31 07:10:41	72.0	spanish	2014-08-31 07:12:09	active
19064	19064.0	2014-08-31 17:45:18	2.0	english	NaN	pending
19065	19065.0	2014-08-31 19:29:19	8352.0	italian	NaN	pending

19066 rows × 6 columns

The problem

It’s Tuesday, September 2, 2014. You’ve just been informed by the head of the Yammer Product team that user engagement is down. Here’s the chart of weekly active users that you’ve been provided:

q = """
SELECT strftime('%Y-%m-%d', occurred_at, 'weekday 0', '-6 days') AS week, COUNT(DISTINCT user_id) AS weekly_active_users
FROM yammer_events
WHERE occurred_at >= '2014-05-01 00:00:00'
AND occurred_at < '2014-09-01 00:00:00'
AND event_type = 'engagement'
GROUP BY 1
ORDER BY 1;
"""

weekly_active_users = pandasql.sqldf(q, globals())

week = weekly_active_users['week']
count_users = weekly_active_users['weekly_active_users']

fig1 = go.Figure()
fig1.add_trace(go.Scatter(x=week, y=count_users, mode='lines', name='num_users'))
fig1.update_layout(title='Weekly Active Users', xaxis_title='Week of', yaxis_title='Number of users')
pio.write_html(fig1, file='figures/fig1.html', auto_open=True)

The above chart shows the number of engaged users each week. Yammer defines engagement as having made some type of server call by interacting with the product (showed in the data as events of type engagement). Any point in this chart can be interpreted as “the number of users who logged at least one engagement event during the week starting on that date”.

Analysis

We are responsible for determining what caused the dip at the end of the chart shown above, and, if appropriate, recommending solutions for the problem.

In this study, we will be considering the following metrics:

Engagement by Factors of Time
New Users (Activation/Growth Rate)
Engagement by Event Type
Engagement by Region
Engagement by Product/Device Type
Email Open and Click-Through Rate

1. Engagement by Factors of Time

“Do we see the engagement levels gradually decreasing over time or only a one time sudden decrease? Is the decrease specific to certain days of the week or certain times of the day?”

Daily Engagement Rate

# engagement rate = num of engagments / num of users
q = """
SELECT strftime('%Y-%m-%d', occurred_at) AS day, COUNT(*) * 1.00 / COUNT(DISTINCT user_id) AS engagement_rate
FROM yammer_events
WHERE occurred_at >= '2014-05-01 00:00:00'
AND occurred_at < '2014-09-01 00:00:00'
AND event_type = 'engagement'
GROUP BY 1
ORDER BY 1
"""

daily_engagement = pandasql.sqldf(q, globals())

daily_engagement.head()

	day	engagement_rate
0	2014-05-01	9.808874
1	2014-05-02	10.765363
2	2014-05-03	8.903448
3	2014-05-04	8.734177
4	2014-05-05	10.050584

day = daily_engagement['day']
engagement_rate = daily_engagement['engagement_rate']

fig2 = go.Figure()
fig2.add_trace(go.Scatter(x=day, y=engagement_rate, mode='lines', name='engagement_rate'))
fig2.update_layout(title='Daily Engagement', xaxis_title='Day', yaxis_title='Engagement rate')
pio.write_html(fig2, file='figures/fig2.html', auto_open=True)

Engagement by Week

q = """
SELECT strftime('%w', occurred_at) AS dow,
       CASE strftime('%w', occurred_at)
       WHEN '0' THEN 'Sun'
       WHEN '1' THEN 'Mon'
       WHEN '2' THEN 'Tues'
       WHEN '3' THEN 'Wed'
       WHEN '4' THEN 'Thurs'
       WHEN '5' THEN 'Fri'
       WHEN '6' THEN 'Sat'
       ELSE NULL END AS day_of_week,
       COUNT(*) * 1.00 / COUNT(DISTINCT user_id) AS active_users
FROM yammer_events
WHERE occurred_at >= '2014-05-01 00:00:00'
AND occurred_at < '2014-09-01 00:00:00'
AND event_type = 'engagement'
GROUP BY 1
ORDER BY 1
"""

dow_engagement = pandasql.sqldf(q, globals())

dow_engagement.head()

	dow	day_of_week	active_users
0	0	Sun	9.633461
1	1	Mon	14.766006
2	2	Tues	16.405758
3	3	Wed	17.235331
4	4	Thurs	17.961189

day = dow_engagement['day_of_week']
engagement_rate = dow_engagement['active_users']

fig3 = go.Figure()
fig3.add_trace(go.Bar(x=day, y=engagement_rate))
fig3.update_layout(title='Engagement by Day of Week', xaxis_title='Day of Week', yaxis_title='Engagement rate')
pio.write_html(fig3, file='figures/fig3.html', auto_open=True)

Engagement by Time of Day

q = """
SELECT strftime('%H', occurred_at) AS hour,
       COUNT(*) * 1.00 / COUNT(DISTINCT user_id) AS engagement_rate
FROM yammer_events
WHERE occurred_at >= '2014-05-01 00:00:00'
AND occurred_at < '2014-9-01 00:00:00'
AND event_type = 'engagement'
GROUP BY 1
ORDER BY 1
"""

time_day = pandasql.sqldf(q, globals())

time_day.head()

	hour	engagement_rate
0	00	9.787736
1	01	9.410628
2	02	8.578125
3	03	9.088710
4	04	9.169014

hour = time_day['hour']
engagement_rate = time_day['engagement_rate']

fig4 = go.Figure()
fig4.add_trace(go.Bar(x=hour, y=engagement_rate))
fig4.update_layout(title='Engagement by Time of Day', xaxis_title='Hour', yaxis_title='Engagement rate')
pio.write_html(fig4, file='figures/fig4.html', auto_open=True)

Conclusions:

From the above graphs, we can see that the drop in engagement was gradual over the span of a few months and not localized to a particular day or time. We can assume that this is not likely due to a one-time server/technical outage.
The Time of Day chart seems consistent in engagement aside from one outlier–9 AM, when users appear to have the most acitivty at this time.
Engagement by Day of Week tells us that our users have the least activty on the days of Saturday and Sunday, which make sense since Yammer is typically used during work hours.

2. New Users (Activation and Growth Rate)

“Since the drop in users were gradual over time and not localized to a particular day of the week or time of day, we should now move on and identify if there is an issue with acquiring new users or activating new users”

q = """
SELECT
  strftime('%Y-%m-%d', created_at) AS day,
  COUNT(*) AS all_users,
  COUNT(CASE WHEN activated_at IS NOT NULL THEN user_id ELSE NULL END) AS activated_users
FROM yammer_users
WHERE created_at >= '2014-05-01 00:00:00'
AND created_at < '2014-09-01 00:00:00'
GROUP BY 1
ORDER BY 1
"""

growth = pandasql.sqldf(q, globals())

day = growth['day']
all_users = growth['all_users']
activated_users = growth['activated_users']

growth.head()

	day	all_users	activated_users
0	2014-05-01	73	34
1	2014-05-02	57	30
2	2014-05-03	19	8
3	2014-05-04	22	9
4	2014-05-05	58	24

fig5 = go.Figure()
fig5.add_trace(go.Scatter(x=day, y=all_users, mode='lines', name='all_users'))
fig5.add_trace(go.Scatter(x=day, y=activated_users, mode='lines', name='activated_users'))
fig5.update_layout(title='Daily signups', xaxis_title='Date', yaxis_title='Number of users')
pio.write_html(fig5, file='figures/fig5.html', auto_open=True)

Conclusions:

The chart above indicates that the growth rate appears normal. We can assume that there are no sign up issues.
We will look into the drop in engagement from our existing users, as Yammer’s goal metric is satisfied (Yammer is providing the intended value for new customers).

3. Engagement by Event Type

By looking at the assumed UX flow, there are four major types of engagement events:

signing up as a new user
logging in to homepage as an existing user
inbox and message related engagement
search and click related engagement

We want to group these together, and see if there is a more pronounced drop in engagement for either groups

q = """
SELECT DISTINCT event_name
FROM yammer_events
"""

events = pandasql.sqldf(q, globals())

events.head(50)

	event_name
0	login
1	home_page
2	like_message
3	view_inbox
4	search_run
5	send_message
6	search_autocomplete
7	search_click_result_10
8	create_user
9	enter_email
10	enter_info
11	complete_signup
12	search_click_result_7
13	search_click_result_8
14	search_click_result_1
15	search_click_result_3
16	search_click_result_2
17	search_click_result_5
18	search_click_result_6
19	search_click_result_9
20	search_click_result_4

q = """
SELECT
  strftime('%Y-%m-%d', occurred_at, 'weekday 0', '-6 days') AS week_of,
  COUNT(DISTINCT CASE WHEN event_name IN ('create_user', 'enter_email', 'enter_info', 'complete_signup') THEN user_id ELSE NULL END) AS signup,
  COUNT(DISTINCT CASE WHEN event_name IN ('login', 'home_page') THEN user_id ELSE NULL END) AS login,
  COUNT(DISTINCT CASE WHEN event_name IN ('view_inbox', 'like_message', 'send_message') THEN user_id ELSE NULL END) AS inbox,
  COUNT(DISTINCT CASE WHEN event_name IN ('search_run', 'search_autocomplete', 'search_click_result_1', 'search_click_result_2', 'search_click_result_3', 'search_click_result_4', 'search_click_result_5', 'search_click_result_6', 'search_click_result_7', 'search_click_result_8', 'search_click_result_9', 'search_click_result_10') THEN user_id ELSE NULL END) AS search_and_click
FROM yammer_events
WHERE occurred_at >= '2014-05-01 00:00:00'
AND occurred_at < '2014-09-01 00:00:00'
GROUP BY 1
ORDER BY 1
"""

event_engagement = pandasql.sqldf(q, globals())

week_of = event_engagement['week_of']
signup = event_engagement['signup']
login = event_engagement['login']
inbox = event_engagement['inbox']
search_and_click = event_engagement['search_and_click']

fig6 = go.Figure()
fig6.add_trace(go.Scatter(x=week_of, y=signup, mode='lines', name='signup'))
fig6.add_trace(go.Scatter(x=week_of, y=login, mode='lines', name='login'))
fig6.add_trace(go.Scatter(x=week_of, y=inbox, mode='lines', name='inbox'))
fig6.add_trace(go.Scatter(x=week_of, y=search_and_click, mode='lines', name='search and click'))
fig6.update_layout(title='Engagement by Event Type', xaxis_title='Week of', yaxis_title='Number of users')
pio.write_html(fig6, file='figures/fig6.html', auto_open=True)

Conclusions:

We observe a decrease in engagement across all events, namely homepage and message activity.
Users appear to be navigating through the signup funnel, as we see no decrease in activty in this category.

4. Engagement by Region

“If the drop is localized to a particular region then there could be an issue related to that.”

q = """
-- Selects Top 5 countries of all-time
WITH region AS (SELECT location, COUNT(user_id) AS num_users FROM yammer_events WHERE occurred_at >= '2014-05-01 00:00:00' AND occurred_at < '2014-09-01 00:00:00' AND event_type = 'engagement' GROUP BY 1 ORDER BY 2 DESC),
top AS (SELECT region.location, region.num_users, ROW_NUMBER() OVER (ORDER BY num_users DESC) AS rank FROM region)
SELECT strftime('%Y-%m-%d', e.occurred_at, 'weekday 0', '-6 days') AS week_of,
       e.location,
       COUNT(DISTINCT e.user_id) AS num_users
FROM yammer_events e
INNER JOIN top
ON e.location = top.location
WHERE occurred_at >= '2014-05-01 00:00:00' 
AND occurred_at < '2014-09-01 00:00:00' 
AND event_type = 'engagement'
AND top.rank <=5
GROUP BY 1, 2
ORDER BY 1, 3 DESC
"""

top_regions = pandasql.sqldf(q, globals())

top_regions.head(10)

	week_of	location	num_users
0	2014-04-28	United States	203
1	2014-04-28	Japan	47
2	2014-04-28	Germany	42
3	2014-04-28	France	34
4	2014-04-28	United Kingdom	33
5	2014-05-05	United States	294
6	2014-05-05	Japan	76
7	2014-05-05	Germany	72
8	2014-05-05	France	53
9	2014-05-05	United Kingdom	49

fig7 = px.line(top_regions, x="week_of", y="num_users", color="location", title='Weekly Engagement by Region')
#iplot(fig7)
pio.write_html(fig7, file='figures/fig7.html', auto_open=True)

Conclusions:

We considered engagement across our top five regions. From the graph, we determine that the United States and the most signficant drop in activtiy.
Because of this observation, we will refer to the Engineering and Marketing teams to determine if a product release or campagin is responsible for the decrease in active U.S. users.

5. Engagement by Device/Product Type

“It is also good to check if the drop in engagement rate is localized to a type of device. We’d want to compare laptops vs phones vs tablets and then perhaps iOS vs androids, or macs vs PCs.”

Engagement by Device Type

We’ll categorize our users by their device type, specifically:

Desktop
Phone
Tablet

q = """
SELECT DISTINCT device
FROM yammer_events
"""

devices = pandasql.sqldf(q, globals())

devices.head(50)

	device
0	dell inspiron notebook
1	iphone 5
2	iphone 4s
3	windows surface
4	macbook air
5	iphone 5s
6	macbook pro
7	kindle fire
8	ipad mini
9	nexus 7
10	nexus 5
11	samsung galaxy s4
12	lenovo thinkpad
13	samsumg galaxy tablet
14	acer aspire notebook
15	asus chromebook
16	htc one
17	nokia lumia 635
18	samsung galaxy note
19	acer aspire desktop
20	mac mini
21	hp pavilion desktop
22	dell inspiron desktop
23	ipad air
24	amazon fire phone
25	nexus 10

q = """
SELECT 
  strftime('%Y-%m-%d', occurred_at, 'weekday 0', '-6 days') AS week_of,
  COUNT(DISTINCT user_id) AS weekly_active_users,
  COUNT(DISTINCT CASE WHEN device IN ('dell inspiron notebook', 'macbook air', 'macbook pro', 'lenovo thinkpad', 'acer aspire notebook', 'asus chromebook', 'acer aspire desktop', 'mac mini', 'hp pavillion desktop', 'dell inspiron desktop') THEN user_id ELSE NULL END) AS desktop,
  COUNT(DISTINCT CASE WHEN device IN ('iphone 5', 'iphone 4s', 'iphone 5s', 'nexus 5', 'samsung galaxy s4', 'htc one', 'nokia lumia 635', 'samsung galaxy note', 'amazon fire phone') THEN user_id ELSE NULL END) AS phone,
  COUNT(DISTINCT CASE WHEN device IN ('windows surface', 'kindle fire', 'ipad mini', 'nexus 7', 'samsung galaxy tablet', 'ipad air', 'nexus 10') THEN user_id ELSE NULL END) AS tablet
FROM yammer_events
WHERE occurred_at >= '2014-05-01 00:00:00'
AND occurred_at < '2014-09-01 00:00:00'
AND event_type = 'engagement'
GROUP BY 1
ORDER BY 1
"""

device_engagement = pandasql.sqldf(q, globals())

week_of = device_engagement['week_of']
weekly_active_users = device_engagement['weekly_active_users']
desktop = device_engagement['desktop']
phone = device_engagement['phone']
tablet = device_engagement['tablet']

fig8 = go.Figure()
fig8.add_trace(go.Scatter(x=week_of, y=weekly_active_users, mode='lines', name='weekly active users'))
fig8.add_trace(go.Scatter(x=week_of, y=desktop, mode='lines', name='desktop'))
fig8.add_trace(go.Scatter(x=week_of, y=phone, mode='lines', name='phone'))
fig8.add_trace(go.Scatter(x=week_of, y=tablet, mode='lines', name='tablet'))
fig8.update_layout(title='Engagement by Device Type', xaxis_title='Week of', yaxis_title='Number of users')
pio.write_html(fig8, file='figures/fig8.html', auto_open=True)

Engagement by Product Type

From the above plot, we can see that our engagement decreased most significantly on the mobile (“phone”) Yammer platform.

We’ll further investigate by looking at the differences in engagement across two mobile OS types:

iOS
Other (Android/Windows)

q = """
SELECT
  strftime('%Y-%m-%d', occurred_at, 'weekday 0', '-6 days') AS week_of,
  COUNT(DISTINCT CASE WHEN device IN ('iphone 4s', 'iphone 5', 'iphone 5s') THEN user_id ELSE NULL END) AS "iOS",
  COUNT(DISTINCT CASE WHEN device IN ('nexus 5', 'samsung galaxy s4', 'htc one', 'nokia lumia 635', 'samsung galaxy note', 'amazon fire phone') THEN user_id ELSE NULL END) AS "Other"
FROM yammer_events
WHERE occurred_at >= '2014-05-01 00:00:00' 
AND occurred_at < '2014-09-01 00:00:00'
AND event_type = 'engagement'
GROUP BY 1
ORDER BY 1
"""

os_engagement = pandasql.sqldf(q, globals())

week_of = os_engagement['week_of']
iOS = os_engagement['iOS']
Other = os_engagement['Other']

fig9 = go.Figure()
fig9.add_trace(go.Scatter(x=week_of, y=iOS, mode='lines', name='iOS'))
fig9.add_trace(go.Scatter(x=week_of, y=Other, mode='lines', name='Other'))
fig9.update_layout(title='Engagement by Phone OS', xaxis_title='Week of', yaxis_title='Number of users')
pio.write_html(fig9, file='figures/fig9.html', auto_open=True)

Conclusions:

Engagement appears to be down across all device types, but we observe a more significant drop in phone users.
Again, we need to check with the Engineering team to see if any recent product releases on the mobile platform are responsible for this decline in engagement.

5. Cohort Analysis

“Since activation rate is normal, we know the issue is not associated with growth (acquiring new users). The drop will likely be due to disengagement from existing users, so we should do a cohort analysis allowing us to compare users at different life stages with Yammer.”

q = """
SELECT 
  strftime('%Y-%m-%d', ue.occurred_at, 'weekday 0', '-6 days') AS week_of,
  COUNT(DISTINCT CASE WHEN ue.user_age < 7 THEN ue.user_id ELSE NULL END) AS '<1 week old',
  COUNT(DISTINCT CASE WHEN ue.user_age >= 7 AND ue.user_age < 14 THEN ue.user_id ELSE NULL END) AS '1 week old',
  COUNT(DISTINCT CASE WHEN ue.user_age >= 14 AND ue.user_age < 21 THEN ue.user_id ELSE NULL END) AS '2 weeks old',
  COUNT(DISTINCT CASE WHEN ue.user_age >= 21 AND ue.user_age < 28 THEN ue.user_id ELSE NULL END) AS '3 weeks old',
  COUNT(DISTINCT CASE WHEN ue.user_age >= 28 AND ue.user_age < 35 THEN ue.user_id ELSE NULL END) AS '4 weeks old',
  COUNT(DISTINCT CASE WHEN ue.user_age >= 35 AND ue.user_age < 42 THEN ue.user_id ELSE NULL END) AS '5 weeks old',
  COUNT(DISTINCT CASE WHEN ue.user_age >= 42 AND ue.user_age < 49 THEN ue.user_id ELSE NULL END) AS '6 weeks old',
  COUNT(DISTINCT CASE WHEN ue.user_age >= 49 AND ue.user_age < 56 THEN ue.user_id ELSE NULL END) AS '7 weeks old',
  COUNT(DISTINCT CASE WHEN ue.user_age >= 56 AND ue.user_age < 63 THEN ue.user_id ELSE NULL END) AS '8 weeks old',
  COUNT(DISTINCT CASE WHEN ue.user_age >= 63 AND ue.user_age < 70 THEN ue.user_id ELSE NULL END) AS '9 weeks old',
  COUNT(DISTINCT CASE WHEN ue.user_age >= 70 THEN ue.user_id ELSE NULL END) AS '10+ weeks old'
FROM (SELECT e.occurred_at, u.user_id, CAST((julianday('2014-09-01') - julianday(u.activated_at)) AS INT) AS user_age
      FROM yammer_users u
      JOIN yammer_events e
      ON e.user_id = u.user_id
      WHERE e.occurred_at >= '2014-05-01 00:00:00'
      AND e.occurred_at < '2014-09-01 00:00:00'
      AND e.event_type = 'engagement'
      AND u.activated_at IS NOT NULL) ue
GROUP BY 1
ORDER BY 1
"""

age_cohort = pandasql.sqldf(q, globals())

age_cohort.head(50)

	week_of	<1 week old	1 week old	2 weeks old	3 weeks old	4 weeks old	5 weeks old	6 weeks old	7 weeks old	8 weeks old	9 weeks old	10+ weeks old
0	2014-04-28	0	0	0	0	0	0	0	0	0	0	701
1	2014-05-05	0	0	0	0	0	0	0	0	0	0	1054
2	2014-05-12	0	0	0	0	0	0	0	0	0	0	1094
3	2014-05-19	0	0	0	0	0	0	0	0	0	0	1147
4	2014-05-26	0	0	0	0	0	0	0	0	0	0	1113
5	2014-06-02	0	0	0	0	0	0	0	0	0	0	1173
6	2014-06-09	0	0	0	0	0	0	0	0	0	0	1219
7	2014-06-16	0	0	0	0	0	0	0	0	0	0	1263
8	2014-06-23	0	0	0	0	0	0	0	0	0	210	1039
9	2014-06-30	0	0	0	0	0	0	0	0	199	151	921
10	2014-07-07	0	0	0	0	0	0	0	223	130	100	902
11	2014-07-14	0	0	0	0	0	0	215	152	82	62	834
12	2014-07-21	0	0	0	0	0	228	144	95	60	44	792
13	2014-07-28	0	0	0	0	234	156	91	83	43	30	806
14	2014-08-04	0	0	0	189	154	82	52	52	34	24	679
15	2014-08-11	0	0	250	126	94	59	33	39	33	19	562
16	2014-08-18	0	259	163	69	64	40	19	26	26	15	522
17	2014-08-25	266	173	82	48	47	31	20	23	14	15	475

week_of = age_cohort['week_of']
y0 = age_cohort['<1 week old']
y1 = age_cohort['1 week old']
y2 = age_cohort['2 weeks old']
y3 = age_cohort['3 weeks old']
y4 = age_cohort['4 weeks old']
y5 = age_cohort['5 weeks old']
y6 = age_cohort['6 weeks old']
y7 = age_cohort['7 weeks old']
y8 = age_cohort['8 weeks old']
y9 = age_cohort['9 weeks old']
y10 = age_cohort['10+ weeks old']

fig10 = go.Figure()
fig10.add_trace(go.Scatter(x=week_of, y=y0, mode='lines', name='<1 week old'))
fig10.add_trace(go.Scatter(x=week_of, y=y1, mode='lines', name='1 week old'))
fig10.add_trace(go.Scatter(x=week_of, y=y2, mode='lines', name='2 weeks old'))
fig10.add_trace(go.Scatter(x=week_of, y=y3, mode='lines', name='3 weeks old'))
fig10.add_trace(go.Scatter(x=week_of, y=y4, mode='lines', name='4 weeks old'))
fig10.add_trace(go.Scatter(x=week_of, y=y5, mode='lines', name='5 weeks old'))
fig10.add_trace(go.Scatter(x=week_of, y=y6, mode='lines', name='6 weeks old'))
fig10.add_trace(go.Scatter(x=week_of, y=y7, mode='lines', name='7 weeks old'))
fig10.add_trace(go.Scatter(x=week_of, y=y8, mode='lines', name='8 weeks old'))
fig10.add_trace(go.Scatter(x=week_of, y=y9, mode='lines', name='9 weeks old'))
fig10.add_trace(go.Scatter(x=week_of, y=y10, mode='lines', name='10+ weeks old'))
fig10.update_layout(title='Engagement by User Age Cohort', xaxis_title='Week of', yaxis_title='Number of Users')
pio.write_html(fig10, file='figures/fig10.html', auto_open=True)

Conclusions:

From this Cohort Analysis, we observe a decrease in engagement specific to our existing users in the 10+ weeks old account age group.
Looking at the engagement rate of new users, we don’t have any data to suggest that feature awareness is an issue.

6. Email Open and CTR

Yammer sends users two types of email to increase engagement:

Weekly digest emails (sent_weekly_digest)
Re-engagement emails (sent_reengagement_email)

We want to see if something has gone wrong with these emails by looking into their open and click-through rates.

q = """
SELECT DISTINCT action
FROM yammer_emails
"""

emails = pandasql.sqldf(q, globals())

emails.head()

	action
0	sent_weekly_digest
1	email_open
2	email_clickthrough
3	sent_reengagement_email

Note: Some preliminary analysis done on email_open rates suggest that >97% of all engagement emails sent are opened within the first hour. Furthermore, less than 2% of engagment emails are opened after the 24-hour mark. Because of this, we’ll be looking at email_open and email_clickthrough rates that occur within a 24-hour window.

q = """
SELECT 
  strftime('%Y-%m-%d', e.occurred_at, 'weekday 0', '-6 days') AS week_of,
  COUNT(CASE WHEN e.action = 'sent_weekly_digest' THEN e.user_id ELSE NULL END) AS emails_sent,
  COUNT(CASE WHEN e.action = 'sent_weekly_digest' THEN e1.user_id ELSE NULL END) AS emails_opened,
  COUNT(CASE WHEN e.action = 'sent_weekly_digest' THEN e2.user_id ELSE NULL END) AS emails_ct,
  COUNT(CASE WHEN e.action = 'sent_reengagement_email' THEN e.user_id ELSE NULL END) AS reengagements_sent,
  COUNT(CASE WHEN e.action = 'sent_reengagement_email' THEN e1.user_id ELSE NULL END) AS reengagements_opened,
  COUNT(CASE WHEN e.action = 'sent_reengagement_email' THEN e2.user_id ELSE NULL END) AS reengagements_ct
FROM yammer_emails e
LEFT JOIN yammer_emails e1
  ON e.user_id = e1.user_id
    AND e1.action = 'email_open'
    AND e1.occurred_at >= e.occurred_at
    AND e1.occurred_at < datetime(e.occurred_at, '+24 hours')
LEFT JOIN yammer_emails e2
  ON e.user_id = e2.user_id
    AND e2.action = 'email_clickthrough'
    AND e2.occurred_at >= e.occurred_at
    AND e2.occurred_at < datetime(e.occurred_at, '+24 hours')
WHERE e.occurred_at >= '2014-05-01 00:00:00'
AND e.occurred_at < '2014-09-01 00:00:00'
GROUP BY 1
ORDER BY 1
"""

email_counts = pandasql.sqldf(q, globals())

email_counts.head()

	week_of	emails_sent	emails_opened	emails_ct	reengagements_sent	reengagements_opened	reengagements_ct
0	2014-04-28	908	246	105	98	86	82
1	2014-05-05	2602	776	306	164	143	128
2	2014-05-12	2665	809	332	175	162	147
3	2014-05-19	2733	834	348	179	161	150
4	2014-05-26	2822	868	314	179	158	139

week_of = email_counts['week_of']
emails_sent = email_counts['emails_sent']
emails_opened = email_counts['emails_opened']
emails_ct = email_counts['emails_ct']
reengagements_sent = email_counts['reengagements_sent']
reengagements_opened = email_counts['reengagements_opened']
reengagements_ct = email_counts['reengagements_ct']

fig11 = go.Figure()
fig11.add_trace(go.Scatter(x=week_of, y=emails_sent, mode='lines', name='emails sent'))
fig11.add_trace(go.Scatter(x=week_of, y=emails_opened, mode='lines', name='emails opened'))
fig11.add_trace(go.Scatter(x=week_of, y=emails_ct, mode='lines', name='emails clicked-through'))
fig11.add_trace(go.Scatter(x=week_of, y=reengagements_sent, mode='lines', name='reengagements sent'))
fig11.add_trace(go.Scatter(x=week_of, y=reengagements_opened, mode='lines', name='reengagements opened'))
fig11.add_trace(go.Scatter(x=week_of, y=reengagements_ct, mode='lines', name='reengagements clicked-through'))
fig11.update_layout(title='Email Open and Click-Through Rates', xaxis_title='Week of', yaxis_title='Number of emails')
pio.write_html(fig11, file='figures/fig11.html', auto_open=True)

Conclusions:

Click-through rate appears to be OK (no sharp decline to indicate a broken feature).
While open rates appear to increase, a further look into the chart indicates that our click-through rate drops after July 28th. This divergent behavior suggests that our email content might not be relevant enough to our users.

Summary

From our analysis, we find that our mobile users are becoming less engaged with the platform. We also see email click-through rates decreasing, which tells us that our email reengagement and retention campaigns aren’t going so well. Our next step is to reach out to the Engineering and Marketing teams to check for potential issues (e.g. product releases, marketing campaigns) that have contributed to the decline in engageent. We’ve successfully ruled-out major bugs (e.g., platform outages, regional differences, signup funnel issues) and have a good understanding of what steps to take next (A/B testing, local segmentation, feature roll-backs, improved email campaigns).

Credit

The data and problem description was provided to us by Mode Analytics. The approach was based off the work of Jodi Zhang. A lot of time and energy was spent by all investigating each query and analysing goal and value metrics.

2021-05-26 00:00:00 -0700 - Written by Jonathan Logan Moran

Page design by Ankit Sultana