plotly
pandas
SQL
pandasql
yammer
case-studies
hypothesis-testing
data-visualisation
python
ipynb
]
In this study, we will be using a few common technologies:
# prerequisites
!pip install pandasql
!pip install plotly
# data tools
import pandas as pd
import pandasql # for querying pandas dataframes
from pandasql import sqldf
# plotting tools
import plotly.graph_objects as go
import plotly.express as px
import plotly.io as pio # output Plotly charts to HMTL
This notebook features the first case study in a series of three that dives into the Yammer dataset. For our analysis we will be considering Yammer’s core metrics (user engagement, retention and growth) and analyzing product-specific usage metrics (e.g. number of times someone views another user’s profile). For clarification, Yammer is a social network for the workplace whose primary goal is to drive better product and business decisions using data. While this specific dataset is fabricated due to privacy and security reasons, it is similar in structure to Yammer’s actual data. With all that said–let’s begin our study…
emails_path = '../src/yammer_emails.csv'
events_path = '../src/yammer_events.csv'
users_path = '../src/yammer_users.csv'
yammer_emails = pd.read_csv(emails_path, sep=',', header=0)
yammer_emails
| user_id | occurred_at | action | user_type | |
|---|---|---|---|---|
| 0 | 0.0 | 2014-05-06 09:30:00 | sent_weekly_digest | 1.0 |
| 1 | 0.0 | 2014-05-13 09:30:00 | sent_weekly_digest | 1.0 |
| 2 | 0.0 | 2014-05-20 09:30:00 | sent_weekly_digest | 1.0 |
| 3 | 0.0 | 2014-05-27 09:30:00 | sent_weekly_digest | 1.0 |
| 4 | 0.0 | 2014-06-03 09:30:00 | sent_weekly_digest | 1.0 |
| ... | ... | ... | ... | ... |
| 90384 | 18814.0 | 2014-08-31 12:12:26 | email_open | 3.0 |
| 90385 | 18814.0 | 2014-08-31 12:12:57 | email_clickthrough | 3.0 |
| 90386 | 18815.0 | 2014-08-31 13:39:56 | sent_reengagement_email | 2.0 |
| 90387 | 18815.0 | 2014-08-31 13:40:14 | email_open | 2.0 |
| 90388 | 18815.0 | 2014-08-31 13:40:47 | email_clickthrough | 2.0 |
90389 rows × 4 columns
yammer_events = pd.read_csv(events_path, sep=',', header=0)
yammer_events
| user_id | occurred_at | event_type | event_name | location | device | user_type | |
|---|---|---|---|---|---|---|---|
| 0 | 10522.0 | 2014-05-02 11:02:39 | engagement | login | Japan | dell inspiron notebook | 3.0 |
| 1 | 10522.0 | 2014-05-02 11:02:53 | engagement | home_page | Japan | dell inspiron notebook | 3.0 |
| 2 | 10522.0 | 2014-05-02 11:03:28 | engagement | like_message | Japan | dell inspiron notebook | 3.0 |
| 3 | 10522.0 | 2014-05-02 11:04:09 | engagement | view_inbox | Japan | dell inspiron notebook | 3.0 |
| 4 | 10522.0 | 2014-05-02 11:03:16 | engagement | search_run | Japan | dell inspiron notebook | 3.0 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 340827 | 18815.0 | 2014-08-31 13:41:46 | engagement | like_message | Ireland | dell inspiron notebook | 2.0 |
| 340828 | 18815.0 | 2014-08-31 13:42:11 | engagement | home_page | Ireland | dell inspiron notebook | 2.0 |
| 340829 | 18815.0 | 2014-08-31 13:42:43 | engagement | send_message | Ireland | dell inspiron notebook | 2.0 |
| 340830 | 18815.0 | 2014-08-31 13:43:07 | engagement | home_page | Ireland | dell inspiron notebook | 2.0 |
| 340831 | 18815.0 | 2014-08-31 13:43:42 | engagement | like_message | Ireland | dell inspiron notebook | 2.0 |
340832 rows × 7 columns
yammer_users = pd.read_csv(users_path, sep=',', header=0)
yammer_users
| user_id | created_at | company_id | language | activated_at | state | |
|---|---|---|---|---|---|---|
| 0 | 0.0 | 2013-01-01 20:59:39 | 5737.0 | english | 2013-01-01 21:01:07 | active |
| 1 | 1.0 | 2013-01-01 13:07:46 | 28.0 | english | NaN | pending |
| 2 | 2.0 | 2013-01-01 10:59:05 | 51.0 | english | NaN | pending |
| 3 | 3.0 | 2013-01-01 18:40:36 | 2800.0 | german | 2013-01-01 18:42:02 | active |
| 4 | 4.0 | 2013-01-01 14:37:51 | 5110.0 | indian | 2013-01-01 14:39:05 | active |
| ... | ... | ... | ... | ... | ... | ... |
| 19061 | 19061.0 | 2014-08-31 13:21:16 | 2156.0 | chinese | 2014-08-31 13:22:50 | active |
| 19062 | 19062.0 | 2014-08-31 19:21:23 | 7520.0 | spanish | NaN | pending |
| 19063 | 19063.0 | 2014-08-31 07:10:41 | 72.0 | spanish | 2014-08-31 07:12:09 | active |
| 19064 | 19064.0 | 2014-08-31 17:45:18 | 2.0 | english | NaN | pending |
| 19065 | 19065.0 | 2014-08-31 19:29:19 | 8352.0 | italian | NaN | pending |
19066 rows × 6 columns
It’s Tuesday, September 2, 2014. You’ve just been informed by the head of the Yammer Product team that user engagement is down. Here’s the chart of weekly active users that you’ve been provided:
q = """
SELECT strftime('%Y-%m-%d', occurred_at, 'weekday 0', '-6 days') AS week, COUNT(DISTINCT user_id) AS weekly_active_users
FROM yammer_events
WHERE occurred_at >= '2014-05-01 00:00:00'
AND occurred_at < '2014-09-01 00:00:00'
AND event_type = 'engagement'
GROUP BY 1
ORDER BY 1;
"""
weekly_active_users = pandasql.sqldf(q, globals())
week = weekly_active_users['week']
count_users = weekly_active_users['weekly_active_users']
fig1 = go.Figure()
fig1.add_trace(go.Scatter(x=week, y=count_users, mode='lines', name='num_users'))
fig1.update_layout(title='Weekly Active Users', xaxis_title='Week of', yaxis_title='Number of users')
pio.write_html(fig1, file='figures/fig1.html', auto_open=True)
The above chart shows the number of engaged users each week. Yammer defines engagement as having made some type of server call by interacting with the product (showed in the data as events of type engagement). Any point in this chart can be interpreted as “the number of users who logged at least one engagement event during the week starting on that date”.
We are responsible for determining what caused the dip at the end of the chart shown above, and, if appropriate, recommending solutions for the problem.
In this study, we will be considering the following metrics:
“Do we see the engagement levels gradually decreasing over time or only a one time sudden decrease? Is the decrease specific to certain days of the week or certain times of the day?”
# engagement rate = num of engagments / num of users
q = """
SELECT strftime('%Y-%m-%d', occurred_at) AS day, COUNT(*) * 1.00 / COUNT(DISTINCT user_id) AS engagement_rate
FROM yammer_events
WHERE occurred_at >= '2014-05-01 00:00:00'
AND occurred_at < '2014-09-01 00:00:00'
AND event_type = 'engagement'
GROUP BY 1
ORDER BY 1
"""
daily_engagement = pandasql.sqldf(q, globals())
daily_engagement.head()
| day | engagement_rate | |
|---|---|---|
| 0 | 2014-05-01 | 9.808874 |
| 1 | 2014-05-02 | 10.765363 |
| 2 | 2014-05-03 | 8.903448 |
| 3 | 2014-05-04 | 8.734177 |
| 4 | 2014-05-05 | 10.050584 |
day = daily_engagement['day']
engagement_rate = daily_engagement['engagement_rate']
fig2 = go.Figure()
fig2.add_trace(go.Scatter(x=day, y=engagement_rate, mode='lines', name='engagement_rate'))
fig2.update_layout(title='Daily Engagement', xaxis_title='Day', yaxis_title='Engagement rate')
pio.write_html(fig2, file='figures/fig2.html', auto_open=True)