A/b testing at Glassdoor - Vikas Sabnani @vsabnani

Page created by Eduardo Alvarez

Careers

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

A/b testing at Glassdoor - Vikas Sabnani @vsabnani

a/b testing at Glassdoor
        Vikas Sabnani
     Sr. Director, Data Science & Analytics

               @vsabnani

We help people everywhere
  find jobs and companies
             they love

24M
 members
           19M
           Unique visitors
                             7M
                             content
                                       12M
                                       jobs   2

Facebook
                                                                                501 Reviews                  97%

          “Experience of a lifetime”                93% of employees recommend this company to a friend
                                                            Free Food, Smart                  Moving Fast, Long
Software Engineer in Menlo Park                     Pros
                                                            People, Move Fast
                                                                                       Cons
                                                                                              Hours, Free Food

 Marketing Interview Question
 “What are you least proud of on your resume?
                                                           Product Designer                   $124K
                                                           Based on 36 employee salaries

     23M
      members
                                  18M
                                  Unique visitors
                                                     6M
                                                    content
                                                                                       12M    jobs           3

we’ll discuss

+ why test

+ types of a/b tests @ Glassdoor

+ conducting a test

+ learnings - dos & don’t’s

why test?

“The fascinating thing about intuition is that a fair percentage of the time it’s
fabulously, gloriously, achingly wrong”
                                          John Quarto-von Tivadar, FutureNow

why test?

because, on the internet, we can

• We have a tendency to build a product for ourselves

• Less time debating… more time building

• Inspires us to think of wildly different ideas

• Forces us to clearly define our goals & metrics

• Kill HiPPO culture

why test?

               lean startup model

                 experiments

 assumptions                   metrics

types of A/B tests @ Glassdoor

+ Traditional split tests

+ Fake data tests

+ 404 tests

+ Fake HTML tests

+ ML weights tests

Traditional Split Tests

UI tweaks - examples and what we have learned

Obama test

UI tweaks - examples and what we have learned

Obama test

+4.5 %

  27

-7 %

  28

Split tests - what have we learnt

Sometimes, less is more

                          sign-up or sign-in to access your resume

                                                                     -6 %

Split tests - what have we learnt

Links should look like links

Sign up     vs      Sign up         +5 %

but be careful

Split tests - what have we learnt

users are extremely averse to losses

Be the first to get new jobs like these

                                          vs

Do not miss new jobs like these                +3 %

split tests - what have we learnt

social proof is powerful

                                    +22 %

split tests - what have we learnt

social proof is powerful

                                    -3 %

split tests - what have we learnt

free is totally different price point

split tests - what have we learnt

colors don’t matter (generally)

Google’s “41 shades of blue” Test

Fake Data Tests

should we build a real-time data stream?

Real
time
click
activity

404 tests

404 tests

 “a good way for a consumer facing, web-based business to
 capture what your visitors really want is to run a live test
 with a non-working link”
                              Stephen Kaufer, CEO, TripAdvisor

                                                                40

404 tests are great for small feature ideas

                                              View results on a map

                                                            41

Fake HTML Tests

Fake HTML tests

“If you clicked on a 42Floors ad for new york office space a while
back, there’s an 89% chance that you landed on one of these eight
versions of the site.

They’re all fake: just throwaway static HTML mockups”

             http://blog.42floors.com/we-test-fake-versions-of-our-site/

                                                                       43

42 floors tried wildly different variations

                                              44

Site redesigns are hard to test

• Very expensive to code and maintain different variants of the site

• Limits number of variations we can test

• Cannot control for consistency in user experience

• SEO implications are hard to predict and impossible to test

                                                                       47

what we did

1. Made several variations of radically different concepts. Created

   mockups - translate to static HTMLs (with real data)

2. Choose 2 pages to focus on - Overview & Salaries page

3. Select a single metric to assess performance - bounce rates

4. Iterate on winning version through traditional Split tests - to be

   launched

                                                                        48

Learning ML weights thru’ testing

Traditional ML systems
                                          X
                     data
                  cleansing
      SiteLogs
    Site   Logs                  Training Data
   Site Logs
                                                             v, B

                              x, e
                                                      Machine
                                                      Learning
     y_hat

                                                      Estimator

                                     y = f(x, v, B)

                                 ML
                                 ML“models”
                                     “models”
     Predictive                    ML models
                                 Structure   &&
       Model                      Structure
                                    (Structure &
                                   weights
                                    weights
                                      weights)
                                                                  50

Traditional ML systems

                                           Feature extraction, model       Tuning &
     Data extraction & cleansing
                                                    training               learning
                                   ~ 50%                               ~ 40%      ~ 10%
1. Site data is messy and imperfect. Make best assumptions of

   user behavior

2. Site data is biased; and we cannot simulate a lab environment at

   scale. Focus on de-biasing

3. Significant effort spend on estimating the best model structure.
   And then on training weights

4. Constantly learning through prediction error

An alternative ML system

Site data is imperfect           Start with a good enough data- set

                                 Fine. We’ll test & measure in the
Site data is biased
                                 same environment

                                 Parameters are more important
Focus on estimating the best     than model structure. Start with a
model structure and parameters
                                 flexible structure. Estimate initial
                                 parameters

Constant learning through re-    Constantly A/B test parameters
fitting                          and learn through MAB

                                 Maximize Revenue or Conversion
Minimize Prediction Error
                                 on Site

A reduced ML system
                                    X
                     data
                  cleansing
      SiteLogs
    Site   Logs               Training Data
   Site Logs
                                                       v, B

                                                Machine
    y_hat, B

                                                Learning
                                                Estimator

                               y = f(x, v, B)

                              ML
                              ML“models”
                                  “models”
     Predictive                 ML models
                              Structure   &&
       Model                   Structure
                                 (Structure &
                                weights
                                 weights
                                   weights)
                                                            53

A reduced ML system

   Data extraction &       Feature extraction,                                 Tuning &
    Data extraction & cleansing        Feature extraction,   Tuning
                                                           model     & learning
                                                                 training
       cleansing             model training                                    learning

performance

                               learning / tuning

     data cleansing

                                                   data cleansing

                                                                                   time

types of A/B tests @ Glassdoor

+ Traditional split tests

+ Fake data tests

+ 404 tests

+ Fake HTML tests

+ ML weights tests

So…

      a/b testing can do
      just about
      anything, but
      don’t make a mess
      of it

ofcourse there are skeptics…

                               “The ultimate outcome
                               of all A/B tests is
                               porn”
                                       - someone on twitter

what’s an A/B test

A/B testing at high level

                         Site
                        traffic

  Control (A)                          Test (B)

                     Instrumentation
                        & tracking

                         Analyze
                         Results

Analyzing results

                                        Control ~ N(2.0%, 0.2%, c)
                    mean = 2.0%

                                   1s

                                   16%

                    Stdev = 0.2%
                                                     2s

  2.5%                                                2.5%

Analyzing results

                                          Control ~ N(2.0%, 0.2%, 1000)
                mean = 2.0%        2.1%   Test ~ N(2.1%, 0.2%, 1000)

                    Stdev = 0.2%   0.2%

Analyzing results

+ Difference in Mean  2.1% - 2.0% = 0.1%

“The expected improvement from Test treatment is 0.1% points”

+ Stdev of the difference  sqrt(sc2 + st2) = sqrt(0.22 + 0.22) = 0.282%

“The difference of 0.1% is within 1-s deviation of the mean. So, it is not statistically significant”

+ For most site metrics, the decision variable is a Bernoulli (ex. Did the user buy? Did

the user bounce?). For large n - a Bernoulli variable follows a Normal distribution

+ Mean of a Bernoulli distribution (m) = simple average

+ Stdev (s) = sqrt(m*(1-m)/n)

conducting a test

Conducting a Test

1. Clearly state your hypotheses
ex. “By adding this feature, we expect conversions to improve & user
experience to not be worse”

                                                                  64

Conducting a test

2. State your metrics and goals

metrics -

“this should generally be “desired action” / input”
conversion = purchases / users
user experience = ??

goals -
“what is the minimum improvement you’d like to see to make this worth building and testing”
improve conversion by 5%

                                                                                              65

Conducting a test

3. Define granularity of analysis
Ex. we’ll break results by country or by new vs repeat

“The more we slice and dice, the more data we need to collect”

                                                                 66

Conducting a test

4. Define a
a = probability that you will incorrectly adopt the test treatment

Choice of a depends on -

(a) how much impact would an incorrect choice have on the business

(b) how difficult it is to find a good alternative

(c) How many test variants you will run

                                                                     67

Conducting a test

4. a - how difficult is it to find a good alternative?

                                         100   # of tests conducted

Experiments with a truly
                                20                  80
good test variant

What we’ll conclude        18        2         72         8

       “~30% (8/26) of treatments we’ll adopt in production will be bad”

                                                                           68

Conducting a test

4. a - how many variants are you testing?

Assuming none of them is any better than control
P (test-1 wins) = 5%
P (test-2 wins) = 5%
P (one of them wins) = 1 - P (none wins) = 1- 0.95 * 0.95 = 9.75%

“There’s almost a 2x chance that we’ll replace Control with one of the Test
treatments”

With 5 variants  23% chance of type-1 error
With 40 variants  87% type-1 error
                                                                              69

Conducting a test

5. Determine your test duration

What segments do you want to exclude?                                     US only; Organic only

What % of traffic do you want to include?                                 20%

What is the baseline Conversion rate?                                     2.5%

What is the minimum improvement you’d like to see?                        3%

How confident do you want to be before rejecting the Control treatment?   95%

How many treatments will you run?                                         5

                                                                                          70

now… things to do and not

do - read results correctly

                     Control           Test-1           Test-2
# Visitors           100,000          100,000          100,000
# transactions        2,000            2,200            2,080
Conversion rate       2.00%            2.20%            2.08%
est. s                0.044%           0.046%           0.045%
conv increase                           0.2%            0.08%
P (test > control)                     99.9%*           89.7%

We can say -

+ We are more than 95% confident that Test-1 is better than Control

+ We are not 95% confident that Test-2 is better than Control

+ We are only 50% confident that Test-1 will increase conv by 0.1% points

+ Our 50/50 estimate from adopting Test-1 is a 0.1% point improvement
                                                                    72

do - run the test for its full duration

Once you decide the duration upfront, let the test run its full
course

+ there is lots of noise up-front

+ chances are not all segments of population have been properly

represented
                                                                  73

do - run the test for its full duration

Random draws from a black-jack game {P (win) = 48.5%}

                  Wait, we did it!

       Kill it!                      Oh no, its down again!

                                                              74

do - be wary of tests that degrade over time

Results from an A/A test

A large majority of tests eventually regress to the mean

                              Significant?

                                                           75

do not - change treatment sizes midway
Changing bucket sizes midway changes behavior of the test

                                    100     # of users

            Test Bucket    20                   80       Regular green Glassdoor
                                     new

             WOW – Light Green is so much better. Let’s bump it to 50%

         repeat                                                     repeat

    4                                                                         16

                             40                40

    Test    4 + 40 = 44 users             4 repeat users (~ 9% of total)

    RG      8 + 40 = 48 users             16 repeat users (~ 29% of total)         76

do not - slice and dice data to find winners

Stick to the grain you defined upfront. If you do find a
grain that “appears” to win - retest at that level

                                                           77

we’ll discuss

+ why test

+ types of a/b tests @ Glassdoor

+ conducting a test

+ learnings - dos & don’t’s

what a/b testing isn’t

• AB testing is not a substitute for basic research or user testing. It

   perfects it

• Testing does not define strategy or direction. It helps get there

   faster and more efficiently

• AB testing does not replace ignorance. It replaces ambiguity

• An excuse to test everything. Be curious, not indecisive

• Not a tool to piss off users

                                                                      79

darwin - our internal A/B test framework

Java based framework

• Population Selection

• Treatment Allocation - Ensures stickiness, unbiased

  randomization, ramp-up & down

• Multivariate testing and independent experiments

• Bootstrapping & Logging through Google Analytics
                                                        80

Vikas Sabnani
       @vsabnani
www.glassdoor.com/careers

You can also read

AHS Research Process Map - Business Process Redesign

2021 CCTS Biology and Environmental Science Summer Project

InterRAI New Zealand Informatics Strategy 2018-2021

National Ballot: Liberal 34.1, Conservative 32.0, NDP 20.9 Liberals gain support and Conservatives decline in wake of TVA French debate.

FOCUS Interim Results: GT005 Gene Therapy Phase I/II Study for the Treatment of Geographic Atrophy

Medicare Part User Group Call - CMS.gov

Stock Market Prediction using Machine Learning Techniques - irjet

Sustainability Performance Plan - greener govCT - The Connecticut Agriculture Experiment Station

Turning banking upside down - Virgin Money Digital Bank is using the cloud to create a faster, more responsive, customer-centric bank of the future

Improvement plan for 2019 to 2021 - Click to upload school logo - Port ...

Bank of India : A veteran in the Indian banking industry deploys Ramco Banking Analytics for smooth functioning - Company Name: Bank of India ...

CDI DATAONE AND SCIENCEBASE ACCESS POINT EXPANSION: PYTHON APPLICATIONS PROGRAMMING INTERFACE AND ARCGIS TOOLKIT DEVELOPMENT

National Ballot: Conservative 34.9, Liberal 33.4, NDP 18.9 Conservatives and Liberals gripped in a tight race for national ballot support.

TRACKING THE CORONAVIRUS - RESULTS FROM A MULTI-COUNTRY POLL March 12-14, 2020 - Ipsos

Vote Intention: Conservatives 34.4, Liberals 33.6, NDP 18.9. Conservatives and Liberals tangled in the lead for national ballot support - #ELXN44 ...

Options for Health Study of Impacts of Well Rupture at Aliso Canyon

Quote Loader Extension by Mike Bray Moneydance 2019-2020-2021

2021 Annual Analytics Summit - Operational Research Society

DNA Tests and Your Privacy - Hope & Walt Charlestown Genealogy Club

RFU620-10510 RFU62x RFID - SICK Germany

Sentinel-1 Toolbox SAR Basics Tutorial - Issued March 2015 Updated November 2019 Updated January 2021 Luis Veci Andreas Braun - STEP

Liberals and Conservatives are gripped in a statistical dead heat as the end of the first week closes - #ELXN44 NIGHTLY BALLOT TRACKING

PUBLICATION GRAVURE SYNOPSIS SHEET - Prepared in the framework of EGTEI - 1 Publication gravure

OP/RCS Mental Health Manual - for State Psychiatric and Regional Acute Care Facilities

Rivet tutorial introduction - Indico

COVID-19 SENTINEL HOSPITAL SURVEILLANCE UPDATE

Finalizing CAST 2019 James Martin and Ed Dunne, WQGIT Co-Chairs June 9, 2020 Management Board Conference Call - Chesapeake Bay Program

2021 Update on the JPSS PGRR Tropical Cyclones Initiative - Monica Bozeman JPSS PGRR Tropical Cyclones Initiative Facilitator

The Dangers of Consumer Grade File Sharing in a Compliance Driven World - ownCloud

Institutional information - Institutional information Concepts and ...

PREPRINT: SUBMITTED FOR PUBLICATION - Misinterpreting statistical anomalies and risk assessment when analysing Covid-19 deaths by ethnicity ...

COVID-19 Impact on US National Overdose Crisis

Specific Accreditation Criteria Inspection ISO/IEC 17020 Annex Application to modelling - January 2018 - NATA

Istat strategies for bridging statistical gaps on living conditions of hard to reach population: the cases of Roma people and Homeless - UNECE

PROMOTING TRANSPARENCY AND INFORMATION SHARING ALONG AGRICULTURAL SUPPLY CHAINS - INSIGHTS OF WETRACE IMPLEMENTATION IN SUB-SAHARAN AFRICA AND ...

OMMAX - Digital Solutions Company Presentation February 2019

IC Market Tracking Strollers and Buggies Europe 2021

Australia's Version 2 ASTER mineral maps unmixed of the effects of green and dry vegetation - Tom Cudahy - Geological Remote ...

IMPROVING DECISION MAKING USING SEMANTIC TECHNOLOGY - ESWC21 PHD SYMPOSIUM TEK RAJ CHHETRI @TEKRAJ_14

Prediction of Indian Premier League-IPL 2020 using Data Mining Algorithms - IJRASET

In-flight performance of the LEKIDs of the OLIMPO experiment - Infn

NPD EXPLORER AND NPDAMCAT: WEB TOOLS SUPPORTING ANALYSIS OF NON-POWERED DAMS - JULY 22, 2021 12-1PM ET CARLY HANSEN, SCOTT DENEALE, CHRIS DEROLPH ...

Forecast of Shanghai Port Throughput Based on ARIMA

Distributed Systems 19. Bigtable - Paul Krzyzanowski Rutgers University Spring 2020

Understanding PIC WEB boards and how to use Microchip's TCP-IP Stack - Copyright(c) 2008, OLIMEX Ltd, All rights reserved

Liberals and Conservatives gripped in a dead heat in popular support at the start of week two of the federal election - #ELXN44 NIGHTLY BALLOT ...

Seasonal And Annual Climate Profile of Adama District, East Shewa, Oromia, Ethiopia - ssrg-journals

MassHealth Provider Resource: Telephone and Internet Connectivity for Telehealth

Classification of diabetes disease using decision tree algorithm (C4.5)

Intégration 3D pour les - datacenters haute performance - IRT Nanoelec