[ plotly  pandas  SQL  pandasql  yammer  case-studies  feasibility-studies  hypothesis-testing  data-visualisation  python  ipynb  ]

Understanding Search Functionality

In this Part 2 of three case studies investigating Yammer–a social network for the workplace, we will be helping the product team determine priorities for their next development cycle. They are considering improving the site’s search functionality, but first need our help deciding whether or not to work on the project in the first place. Thus, it’s our job to consider the impact that search has on users, and see if there’s any specific improvements that can be made. We’ll be making recommendations to the development team, so it’s best that we form quantitative hypotheses that seek to understand the general quality of an individual user’s search experience.

Let’s get started…

Programming

In this study, we will be using a few common technologies:

# prerequisites
!pip install pandasql
!pip install plotly
# data tools
import pandas as pd					# for querying pandas dataframes
import pandasql
from pandasql import sqldf
# plotting tools
import plotly.graph_objects as go
import plotly.express as px
import plotly.io as pio					# output charts to HMTL
#init_notebook_mode()

Collecting our datasets

There are two tables that are relevant to this problem, yammer_events and yammer_users.

It is going to be crucial that we look at the following events in our events table:

events_path = '../src/yammer_events.csv'
yammer_events = pd.read_csv(events_path, sep=',', header=0)
yammer_events
user_id occurred_at event_type event_name location device user_type
0 10522.0 2014-05-02 11:02:39 engagement login Japan dell inspiron notebook 3.0
1 10522.0 2014-05-02 11:02:53 engagement home_page Japan dell inspiron notebook 3.0
2 10522.0 2014-05-02 11:03:28 engagement like_message Japan dell inspiron notebook 3.0
3 10522.0 2014-05-02 11:04:09 engagement view_inbox Japan dell inspiron notebook 3.0
4 10522.0 2014-05-02 11:03:16 engagement search_run Japan dell inspiron notebook 3.0
... ... ... ... ... ... ... ...
340827 18815.0 2014-08-31 13:41:46 engagement like_message Ireland dell inspiron notebook 2.0
340828 18815.0 2014-08-31 13:42:11 engagement home_page Ireland dell inspiron notebook 2.0
340829 18815.0 2014-08-31 13:42:43 engagement send_message Ireland dell inspiron notebook 2.0
340830 18815.0 2014-08-31 13:43:07 engagement home_page Ireland dell inspiron notebook 2.0
340831 18815.0 2014-08-31 13:43:42 engagement like_message Ireland dell inspiron notebook 2.0

340832 rows × 7 columns

The problem

Here’s a re-hash of the above: Yammer’s product team is determining priorities for the next development cycle. They’re considering improving the site’s search functionality. Before they proceed, they want to know whether they should even work on search in the first place, and, if so, how they should modify it.

Making a recommendation

In particular, you should seek to answer the following questions:

Analysis

In order to form our hypotheses (by way of metrics we will investigate), we need to use the above questions to better understand the ultimate purpose of search at Yammer. For this study, we will consider a session to be a string of events that occur without a 10-minute break between any two events. That goes to say that if an active users fails to log an event within a 10-minute window, their session is considered over and the next engagement will mark a new session.

1. Search Use

The first thing to understand is whether anyone even uses search at all.

Search Use Over Time

q = """
SELECT
  week_of,
  (with_autocompletes * 1.00 / sessions) * 100 AS perct_autocompletes,
  (with_runs * 1.00 / sessions) * 100 AS perct_runs
FROM (
      SELECT
        strftime('%Y-%m-%d', session_start, 'weekday 0', '-6 days') AS week_of,
        COUNT(*) sessions,
        COUNT(CASE WHEN autocompletes > 0 THEN session ELSE NULL END) AS with_autocompletes,
        COUNT(CASE WHEN runs > 0 THEN session ELSE NULL END) AS with_runs
      FROM (
            SELECT
              strftime('%Y-%m-%d %H:%M:%S', session_start) AS session_start,
              session,
              user_id,
              COUNT(CASE WHEN event_name = 'search_autocomplete' THEN user_id ELSE NULL END) AS autocompletes,
              COUNT(CASE WHEN event_name = 'search_run' THEN user_id ELSE NULL END) AS runs,
              COUNT(CASE WHEN event_name LIKE 'search_click_result_%' THEN user_id ELSE NULL END) AS clicks
            FROM (
                  SELECT
                    e.*,
                    sessions.session,
                    sessions.session_start
                  FROM yammer_events e
                  LEFT JOIN (
                              SELECT
                                user_id,
                                session,
                                MIN(occurred_at) AS session_start,
                                MAX(occurred_at) AS session_end
                              FROM (
                                    SELECT
                                      intervals.*,
                                      CASE WHEN prev_event >= 10.0 THEN id
                                           WHEN prev_event IS NULL THEN id
                                           ELSE LAG(id, 1) OVER (PARTITION BY user_id ORDER BY occurred_at)
                                           END AS session
                                    FROM (
                                          SELECT
                                            user_id,
                                            occurred_at,
                                            event_type,
                                            event_name,
                                            ROUND((JULIANDAY(occurred_at) - JULIANDAY(LAG(occurred_at, 1) OVER (PARTITION BY user_id ORDER BY occurred_at))) * 1440) AS prev_event,
                                            ROUND((JULIANDAY(LEAD(occurred_at, 1) OVER (PARTITION BY user_id ORDER BY occurred_at)) - JULIANDAY(occurred_at)) * 1440) AS next_event,
                                            ROW_NUMBER() OVER () AS id
                                          FROM yammer_events
                                          WHERE event_type = 'engagement'
                                          ORDER BY user_id, occurred_at
                                    ) intervals
                                    WHERE prev_event >= 10.0
                                       OR next_event >= 10.0
                                       OR prev_event IS NULL
                                       OR next_event IS NULL
                              ) bounds
                              GROUP BY 1, 2
                  ) sessions
                  ON e.user_id = sessions.user_id
                  AND e.occurred_at >= sessions.session_start
                  AND e.occurred_at <= sessions.session_end
                  WHERE e.event_type = 'engagement'
            ) events
            GROUP BY 1, 2, 3
      ) counts
      GROUP BY 1
      ORDER BY 1
) counts
GROUP BY 1
ORDER BY 1
"""
searches = pandasql.sqldf(q, globals())
week_of = searches['week_of']
perct_autocompletes = searches['perct_autocompletes']
perct_runs = searches['perct_runs']
fig1 = go.Figure()
fig1.add_trace(go.Scatter(x=week_of, y=perct_autocompletes, mode='lines', name='searches with autocompletes'))
fig1.add_trace(go.Scatter(x=week_of, y=perct_runs, mode='lines', name='searches with runs'))
fig1.update_layout(title='Search Rate by Week', xaxis_title='Week of', yaxis_title='Percent of sessions')
pio.write_html(fig1, file='2021-06-01-Understanding-Search-Functionality-fig1.html', auto_open=True)

Search Use Per Session

q = """
SELECT
  autocompletes,
  COUNT(*) AS sessions
FROM (
      SELECT
        strftime('%Y-%m-%d %H:%M:%S', session_start) AS session_start,
        session,
        user_id,
        COUNT(CASE WHEN event_name = 'search_autocomplete' THEN user_id ELSE NULL END) AS autocompletes,
        COUNT(CASE WHEN event_name = 'search_run' THEN user_id ELSE NULL END) AS runs,
        COUNT(CASE WHEN event_name LIKE 'search_click_result_%' THEN user_id ELSE NULL END) AS clicks
      FROM (
            SELECT
              e.*,
              session,
              session_start
            FROM yammer_events e
            LEFT JOIN (
                        SELECT
                          user_id,
                          session,
                          MIN(occurred_at) AS session_start,
                          MAX(occurred_at) AS session_end
                        FROM (
                              SELECT
                                intervals.*,
                                CASE WHEN prev_event >= 10.0 THEN id
                                     WHEN prev_event IS NULL THEN id
                                     ELSE LAG(id, 1) OVER (PARTITION BY user_id ORDER BY occurred_at)
                                     END AS session
                              FROM (
                                    SELECT
                                      user_id,
                                      occurred_at,
                                      event_type,
                                      event_name,
                                      ROUND((JULIANDAY(occurred_at) - JULIANDAY(LAG(occurred_at, 1) OVER (PARTITION BY user_id ORDER BY occurred_at))) * 1440) AS prev_event,
                                      ROUND((JULIANDAY(LEAD(occurred_at, 1) OVER (PARTITION BY user_id ORDER BY occurred_at)) - JULIANDAY(occurred_at)) * 1440) AS next_event,
                                      ROW_NUMBER() OVER () AS id
                                    FROM yammer_events
                                    WHERE event_type = 'engagement'
                              ) intervals
                              WHERE prev_event >= 10.0
                                 OR next_event >= 10.0
                                 OR prev_event IS NULL
                                 OR next_event IS NULL
                        ) bounds
                        GROUP BY 1, 2
            ) sessions
            ON e.user_id = sessions.user_id
            AND e.occurred_at >= sessions.session_start
            AND e.occurred_at <= sessions.session_end
            WHERE e.event_type = 'engagement'
      ) events
      GROUP BY 1, 2, 3
) counts
WHERE autocompletes > 0
GROUP BY 1
ORDER BY 1
"""
autocompletes = pandasql.sqldf(q, globals())
num_autocompletes = autocompletes['autocompletes']
num_sessions = autocompletes['sessions']
fig2 = go.Figure()
fig2.add_trace(go.Bar(x=num_autocompletes, y=num_sessions, text=num_sessions, textposition='auto', name='sessions with autocomplete'))
fig2.update_yaxes(nticks=6)
fig2.update_layout(title='Number of Sessions with Autocompletes', xaxis_title='Autocompletes per session', yaxis_title='Number of sessions')
pio.write_html(fig2, file='2021-06-01-Understanding-Search-Functionality-fig2.html', auto_open=True)