plotly
pandas
SQL
pandasql
yammer
case-studies
feasibility-studies
hypothesis-testing
data-visualisation
python
ipynb
]
In this Part 2 of three case studies investigating Yammer–a social network for the workplace, we will be helping the product team determine priorities for their next development cycle. They are considering improving the site’s search functionality, but first need our help deciding whether or not to work on the project in the first place. Thus, it’s our job to consider the impact that search has on users, and see if there’s any specific improvements that can be made. We’ll be making recommendations to the development team, so it’s best that we form quantitative hypotheses that seek to understand the general quality of an individual user’s search experience.
Let’s get started…
In this study, we will be using a few common technologies:
# prerequisites
!pip install pandasql
!pip install plotly
# data tools
import pandas as pd # for querying pandas dataframes
import pandasql
from pandasql import sqldf
# plotting tools
import plotly.graph_objects as go
import plotly.express as px
import plotly.io as pio # output charts to HMTL
#init_notebook_mode()
There are two tables that are relevant to this problem, yammer_events and yammer_users.
It is going to be crucial that we look at the following events in our events table:
search_autocomplete: this event is logged when a user clicks on a search option from autocomplete.search_run: this event is logged when a user runs a search and sees the results page.search_click_X: this event is logged when a user clicks on a search result X, which ranges from 1 to 10.events_path = '../src/yammer_events.csv'
yammer_events = pd.read_csv(events_path, sep=',', header=0)
yammer_events
| user_id | occurred_at | event_type | event_name | location | device | user_type | |
|---|---|---|---|---|---|---|---|
| 0 | 10522.0 | 2014-05-02 11:02:39 | engagement | login | Japan | dell inspiron notebook | 3.0 |
| 1 | 10522.0 | 2014-05-02 11:02:53 | engagement | home_page | Japan | dell inspiron notebook | 3.0 |
| 2 | 10522.0 | 2014-05-02 11:03:28 | engagement | like_message | Japan | dell inspiron notebook | 3.0 |
| 3 | 10522.0 | 2014-05-02 11:04:09 | engagement | view_inbox | Japan | dell inspiron notebook | 3.0 |
| 4 | 10522.0 | 2014-05-02 11:03:16 | engagement | search_run | Japan | dell inspiron notebook | 3.0 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 340827 | 18815.0 | 2014-08-31 13:41:46 | engagement | like_message | Ireland | dell inspiron notebook | 2.0 |
| 340828 | 18815.0 | 2014-08-31 13:42:11 | engagement | home_page | Ireland | dell inspiron notebook | 2.0 |
| 340829 | 18815.0 | 2014-08-31 13:42:43 | engagement | send_message | Ireland | dell inspiron notebook | 2.0 |
| 340830 | 18815.0 | 2014-08-31 13:43:07 | engagement | home_page | Ireland | dell inspiron notebook | 2.0 |
| 340831 | 18815.0 | 2014-08-31 13:43:42 | engagement | like_message | Ireland | dell inspiron notebook | 2.0 |
340832 rows × 7 columns
Here’s a re-hash of the above: Yammer’s product team is determining priorities for the next development cycle. They’re considering improving the site’s search functionality. Before they proceed, they want to know whether they should even work on search in the first place, and, if so, how they should modify it.
In particular, you should seek to answer the following questions:
In order to form our hypotheses (by way of metrics we will investigate), we need to use the above questions to better understand the ultimate purpose of search at Yammer. For this study, we will consider a session to be a string of events that occur without a 10-minute break between any two events. That goes to say that if an active users fails to log an event within a 10-minute window, their session is considered over and the next engagement will mark a new session.
The first thing to understand is whether anyone even uses search at all.
q = """
SELECT
week_of,
(with_autocompletes * 1.00 / sessions) * 100 AS perct_autocompletes,
(with_runs * 1.00 / sessions) * 100 AS perct_runs
FROM (
SELECT
strftime('%Y-%m-%d', session_start, 'weekday 0', '-6 days') AS week_of,
COUNT(*) sessions,
COUNT(CASE WHEN autocompletes > 0 THEN session ELSE NULL END) AS with_autocompletes,
COUNT(CASE WHEN runs > 0 THEN session ELSE NULL END) AS with_runs
FROM (
SELECT
strftime('%Y-%m-%d %H:%M:%S', session_start) AS session_start,
session,
user_id,
COUNT(CASE WHEN event_name = 'search_autocomplete' THEN user_id ELSE NULL END) AS autocompletes,
COUNT(CASE WHEN event_name = 'search_run' THEN user_id ELSE NULL END) AS runs,
COUNT(CASE WHEN event_name LIKE 'search_click_result_%' THEN user_id ELSE NULL END) AS clicks
FROM (
SELECT
e.*,
sessions.session,
sessions.session_start
FROM yammer_events e
LEFT JOIN (
SELECT
user_id,
session,
MIN(occurred_at) AS session_start,
MAX(occurred_at) AS session_end
FROM (
SELECT
intervals.*,
CASE WHEN prev_event >= 10.0 THEN id
WHEN prev_event IS NULL THEN id
ELSE LAG(id, 1) OVER (PARTITION BY user_id ORDER BY occurred_at)
END AS session
FROM (
SELECT
user_id,
occurred_at,
event_type,
event_name,
ROUND((JULIANDAY(occurred_at) - JULIANDAY(LAG(occurred_at, 1) OVER (PARTITION BY user_id ORDER BY occurred_at))) * 1440) AS prev_event,
ROUND((JULIANDAY(LEAD(occurred_at, 1) OVER (PARTITION BY user_id ORDER BY occurred_at)) - JULIANDAY(occurred_at)) * 1440) AS next_event,
ROW_NUMBER() OVER () AS id
FROM yammer_events
WHERE event_type = 'engagement'
ORDER BY user_id, occurred_at
) intervals
WHERE prev_event >= 10.0
OR next_event >= 10.0
OR prev_event IS NULL
OR next_event IS NULL
) bounds
GROUP BY 1, 2
) sessions
ON e.user_id = sessions.user_id
AND e.occurred_at >= sessions.session_start
AND e.occurred_at <= sessions.session_end
WHERE e.event_type = 'engagement'
) events
GROUP BY 1, 2, 3
) counts
GROUP BY 1
ORDER BY 1
) counts
GROUP BY 1
ORDER BY 1
"""
searches = pandasql.sqldf(q, globals())
week_of = searches['week_of']
perct_autocompletes = searches['perct_autocompletes']
perct_runs = searches['perct_runs']
fig1 = go.Figure()
fig1.add_trace(go.Scatter(x=week_of, y=perct_autocompletes, mode='lines', name='searches with autocompletes'))
fig1.add_trace(go.Scatter(x=week_of, y=perct_runs, mode='lines', name='searches with runs'))
fig1.update_layout(title='Search Rate by Week', xaxis_title='Week of', yaxis_title='Percent of sessions')
pio.write_html(fig1, file='2021-06-01-Understanding-Search-Functionality-fig1.html', auto_open=True)
q = """
SELECT
autocompletes,
COUNT(*) AS sessions
FROM (
SELECT
strftime('%Y-%m-%d %H:%M:%S', session_start) AS session_start,
session,
user_id,
COUNT(CASE WHEN event_name = 'search_autocomplete' THEN user_id ELSE NULL END) AS autocompletes,
COUNT(CASE WHEN event_name = 'search_run' THEN user_id ELSE NULL END) AS runs,
COUNT(CASE WHEN event_name LIKE 'search_click_result_%' THEN user_id ELSE NULL END) AS clicks
FROM (
SELECT
e.*,
session,
session_start
FROM yammer_events e
LEFT JOIN (
SELECT
user_id,
session,
MIN(occurred_at) AS session_start,
MAX(occurred_at) AS session_end
FROM (
SELECT
intervals.*,
CASE WHEN prev_event >= 10.0 THEN id
WHEN prev_event IS NULL THEN id
ELSE LAG(id, 1) OVER (PARTITION BY user_id ORDER BY occurred_at)
END AS session
FROM (
SELECT
user_id,
occurred_at,
event_type,
event_name,
ROUND((JULIANDAY(occurred_at) - JULIANDAY(LAG(occurred_at, 1) OVER (PARTITION BY user_id ORDER BY occurred_at))) * 1440) AS prev_event,
ROUND((JULIANDAY(LEAD(occurred_at, 1) OVER (PARTITION BY user_id ORDER BY occurred_at)) - JULIANDAY(occurred_at)) * 1440) AS next_event,
ROW_NUMBER() OVER () AS id
FROM yammer_events
WHERE event_type = 'engagement'
) intervals
WHERE prev_event >= 10.0
OR next_event >= 10.0
OR prev_event IS NULL
OR next_event IS NULL
) bounds
GROUP BY 1, 2
) sessions
ON e.user_id = sessions.user_id
AND e.occurred_at >= sessions.session_start
AND e.occurred_at <= sessions.session_end
WHERE e.event_type = 'engagement'
) events
GROUP BY 1, 2, 3
) counts
WHERE autocompletes > 0
GROUP BY 1
ORDER BY 1
"""
autocompletes = pandasql.sqldf(q, globals())
num_autocompletes = autocompletes['autocompletes']
num_sessions = autocompletes['sessions']
fig2 = go.Figure()
fig2.add_trace(go.Bar(x=num_autocompletes, y=num_sessions, text=num_sessions, textposition='auto', name='sessions with autocomplete'))
fig2.update_yaxes(nticks=6)
fig2.update_layout(title='Number of Sessions with Autocompletes', xaxis_title='Autocompletes per session', yaxis_title='Number of sessions')
pio.write_html(fig2, file='2021-06-01-Understanding-Search-Functionality-fig2.html', auto_open=True)