[ plotly  pandas  SQL  pandasql  yammer  case-studies  hypothesis-testing  data-visualisation  python  ipynb  ]

Investigating a Drop in User Engagement

In this study, we will be using a few common technologies:

Programming

# prerequisites
!pip install pandasql
!pip install plotly
# data tools
import pandas as pd
import pandasql					# for querying pandas dataframes
from pandasql import sqldf

# plotting tools
import plotly.graph_objects as go
import plotly.express as px
import plotly.io as pio				# output Plotly charts to HMTL

Task overview

This notebook features the first case study in a series of three that dives into the Yammer dataset. For our analysis we will be considering Yammer’s core metrics (user engagement, retention and growth) and analyzing product-specific usage metrics (e.g. number of times someone views another user’s profile). For clarification, Yammer is a social network for the workplace whose primary goal is to drive better product and business decisions using data. While this specific dataset is fabricated due to privacy and security reasons, it is similar in structure to Yammer’s actual data. With all that said–let’s begin our study…

Collecting our dataset

emails_path = '../src/yammer_emails.csv'
events_path = '../src/yammer_events.csv'
users_path = '../src/yammer_users.csv'
yammer_emails = pd.read_csv(emails_path, sep=',', header=0)
yammer_emails
user_id occurred_at action user_type
0 0.0 2014-05-06 09:30:00 sent_weekly_digest 1.0
1 0.0 2014-05-13 09:30:00 sent_weekly_digest 1.0
2 0.0 2014-05-20 09:30:00 sent_weekly_digest 1.0
3 0.0 2014-05-27 09:30:00 sent_weekly_digest 1.0
4 0.0 2014-06-03 09:30:00 sent_weekly_digest 1.0
... ... ... ... ...
90384 18814.0 2014-08-31 12:12:26 email_open 3.0
90385 18814.0 2014-08-31 12:12:57 email_clickthrough 3.0
90386 18815.0 2014-08-31 13:39:56 sent_reengagement_email 2.0
90387 18815.0 2014-08-31 13:40:14 email_open 2.0
90388 18815.0 2014-08-31 13:40:47 email_clickthrough 2.0

90389 rows × 4 columns

yammer_events = pd.read_csv(events_path, sep=',', header=0)
yammer_events
user_id occurred_at event_type event_name location device user_type
0 10522.0 2014-05-02 11:02:39 engagement login Japan dell inspiron notebook 3.0
1 10522.0 2014-05-02 11:02:53 engagement home_page Japan dell inspiron notebook 3.0
2 10522.0 2014-05-02 11:03:28 engagement like_message Japan dell inspiron notebook 3.0
3 10522.0 2014-05-02 11:04:09 engagement view_inbox Japan dell inspiron notebook 3.0
4 10522.0 2014-05-02 11:03:16 engagement search_run Japan dell inspiron notebook 3.0
... ... ... ... ... ... ... ...
340827 18815.0 2014-08-31 13:41:46 engagement like_message Ireland dell inspiron notebook 2.0
340828 18815.0 2014-08-31 13:42:11 engagement home_page Ireland dell inspiron notebook 2.0
340829 18815.0 2014-08-31 13:42:43 engagement send_message Ireland dell inspiron notebook 2.0
340830 18815.0 2014-08-31 13:43:07 engagement home_page Ireland dell inspiron notebook 2.0
340831 18815.0 2014-08-31 13:43:42 engagement like_message Ireland dell inspiron notebook 2.0

340832 rows × 7 columns

yammer_users = pd.read_csv(users_path, sep=',', header=0)
yammer_users
user_id created_at company_id language activated_at state
0 0.0 2013-01-01 20:59:39 5737.0 english 2013-01-01 21:01:07 active
1 1.0 2013-01-01 13:07:46 28.0 english NaN pending
2 2.0 2013-01-01 10:59:05 51.0 english NaN pending
3 3.0 2013-01-01 18:40:36 2800.0 german 2013-01-01 18:42:02 active
4 4.0 2013-01-01 14:37:51 5110.0 indian 2013-01-01 14:39:05 active
... ... ... ... ... ... ...
19061 19061.0 2014-08-31 13:21:16 2156.0 chinese 2014-08-31 13:22:50 active
19062 19062.0 2014-08-31 19:21:23 7520.0 spanish NaN pending
19063 19063.0 2014-08-31 07:10:41 72.0 spanish 2014-08-31 07:12:09 active
19064 19064.0 2014-08-31 17:45:18 2.0 english NaN pending
19065 19065.0 2014-08-31 19:29:19 8352.0 italian NaN pending

19066 rows × 6 columns

The problem

It’s Tuesday, September 2, 2014. You’ve just been informed by the head of the Yammer Product team that user engagement is down. Here’s the chart of weekly active users that you’ve been provided:

q = """
SELECT strftime('%Y-%m-%d', occurred_at, 'weekday 0', '-6 days') AS week, COUNT(DISTINCT user_id) AS weekly_active_users
FROM yammer_events
WHERE occurred_at >= '2014-05-01 00:00:00'
AND occurred_at < '2014-09-01 00:00:00'
AND event_type = 'engagement'
GROUP BY 1
ORDER BY 1;
"""
weekly_active_users = pandasql.sqldf(q, globals())
week = weekly_active_users['week']
count_users = weekly_active_users['weekly_active_users']
fig1 = go.Figure()
fig1.add_trace(go.Scatter(x=week, y=count_users, mode='lines', name='num_users'))
fig1.update_layout(title='Weekly Active Users', xaxis_title='Week of', yaxis_title='Number of users')
pio.write_html(fig1, file='figures/fig1.html', auto_open=True)

The above chart shows the number of engaged users each week. Yammer defines engagement as having made some type of server call by interacting with the product (showed in the data as events of type engagement). Any point in this chart can be interpreted as “the number of users who logged at least one engagement event during the week starting on that date”.

Analysis

We are responsible for determining what caused the dip at the end of the chart shown above, and, if appropriate, recommending solutions for the problem.

In this study, we will be considering the following metrics:

1. Engagement by Factors of Time

“Do we see the engagement levels gradually decreasing over time or only a one time sudden decrease? Is the decrease specific to certain days of the week or certain times of the day?”

Daily Engagement Rate

# engagement rate = num of engagments / num of users
q = """
SELECT strftime('%Y-%m-%d', occurred_at) AS day, COUNT(*) * 1.00 / COUNT(DISTINCT user_id) AS engagement_rate
FROM yammer_events
WHERE occurred_at >= '2014-05-01 00:00:00'
AND occurred_at < '2014-09-01 00:00:00'
AND event_type = 'engagement'
GROUP BY 1
ORDER BY 1
"""
daily_engagement = pandasql.sqldf(q, globals())
daily_engagement.head()
day engagement_rate
0 2014-05-01 9.808874
1 2014-05-02 10.765363
2 2014-05-03 8.903448
3 2014-05-04 8.734177
4 2014-05-05 10.050584
day = daily_engagement['day']
engagement_rate = daily_engagement['engagement_rate']
fig2 = go.Figure()
fig2.add_trace(go.Scatter(x=day, y=engagement_rate, mode='lines', name='engagement_rate'))
fig2.update_layout(title='Daily Engagement', xaxis_title='Day', yaxis_title='Engagement rate')
pio.write_html(fig2, file='figures/fig2.html', auto_open=True)