Learning Cheap and Novel Flight Itineraries - Dima Karamshuk Skyscanner - AI Ukraine

Page created by Crystal Lucas

Travel

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Learning Cheap and Novel Flight Itineraries - Dima Karamshuk Skyscanner - AI Ukraine

Learning Cheap and Novel
Flight Itineraries
Dima Karamshuk
Skyscanner

How much time you spend to
                       choose a flight?

Planning a Trip

3.5h European travellers spend on average to find a perfect
                   flight, often longer than the flight itself https://goo.gl/74CivT

Planning a Trip

3.5h European travellers spend on average to find a perfect
                   flight, often longer than the flight itself https://goo.gl/74CivT

Planning a Trip

                    How much of you choose airline
                             by price?

3.5h European travellers spend on average to find a perfect
                   flight, often longer than the flight itself https://goo.gl/74CivT

Planning a Trip

                  37% of users choose airlines by competitive price, more want to
                   see cheapest price for comparison https://goo.gl/8UX3vx

Skyscanner

Airlines

                Travel Agents
                                                                        End User
                Online Travel
Skyscanner        Agents

in a Nutshell      Global
                Distribution
                 Systems

                   Each user search triggers dozens/hundreds requests to partners
                    resulting in a total of 7B/day quotes

Airlines

                Travel Agents
                                                                        End User
                Online Travel
Skyscanner        Agents

in a Nutshell      Global
                Distribution
                 Systems

                   Each user search triggers dozens/hundreds requests to partners
                    resulting in a total of 7B/day quotes

                    •   Repeated requests with 85% probability return same price

Airlines

          Travel Agents
                           Quote Cache                      End User
          Online Travel
Caching     Agents

Quotes       Global
          Distribution
           Systems

                          Strong case for caching quotes:
                           reduced load on partners
                           faster results to end users

Prices are changing dynamically, so, caching may introduce inaccuracies

Problem
with Caching

bookings
Analysis

                                    price accuracy
           Bookings drop significantly even if the prices are slightly inaccurate

Load on Partners
Caching                                                         and Response Time
Trade-off   Accuracy of Quotes

                Optimal trade-off: Update prices only/always when they change

cached                     cached
                            not cached

Erlang’s Loss                             u – TTL of the cached quote
Model                                     r – frequency of requests
                             (∗$
                ℎ"# $%#"& =
                            1+(∗$          assuming “memoryless”
                                               Poission arrivals

little decrease of cache hit ratio

Simple                                              large decrease
                                                        of TTL
Strategy
           Example:                             TTL * frequency

            u = 8h, r = 1/h, hit ratio = 88%

            if we decrease TTL by half (u = 4h) => hit ratio will decrease by only 8%

            at the same time we will decrease (by half ?) the average age of cached
             quotes served to users

From:   Moscow   To:   Barcelona

Price Volatility
Not Easy

From:   Moscow   To:   Bangkok

Price Volatility
Not Easy

1.   Approach N1: constant cache expiry times
                         simple to implement
                         does not accurately model price volatility

Predicting         2.   Approach N2: emulate pricing models of each individual partner
Price Volatility         pricing models of some airlines are incredibly complex

                   3.   Approach N3: machine learning approach
                         best trade-off between simplicity and accuracy

Model
Performance

Data Science

Formulating Machine
                 Learning Problem

                 Offline Evaluation
Product Cycle

Data Acquisition

                   Data Insights

                Formulating Machine
                 Learning Problem
                                          only
                                       20% of work
                 Offline Evaluation
Product Cycle
                Production Prototype

                 Online Evaluation

                    Production

                  Model Versioning

Data Science
Structure

               + Great autonomy                + Ensured utilization
               - Risk of marginalization       - Lesser autonomy, focus on second-class tasks

                                       https://goo.gl/5cdPjP

Hybrid
Structures

              part-time embedded, part-time    clusters of embedded data scientists
               autonomous                        focused on the same goal

                   https://goo.gl/WJv8TR               https://goo.gl/mtQvyn

New vs.
Optimizing
Old Features

               • it’s easier to build new ML feature than optimizing what works OK already

Constructing mixed-carrier itineraries

                         Virgin
                        Ryanair

Second Try:
Constructed                                               Ryanair   American

Itineraries   Dublin                   Barcelona

                                              Barcelona                        New York

                       Aer Lingus
                         Delta

Competitive Itineraries are the ones
                                                in the Top-10 cheapest search results

Competitive
Itineraries

              Potentially cheaper itineraries in more than half of all search results

•   Combinations require more queries to ticket providers
          •   Most of variants are not competitive

                                                               Airlines

                                                            Online Travel
Problem       Traveler
                                                             Agencies

                                                         Global Distribution
                                                             Systems

          Solution: Only choose combinations which are likely to be competitive

Data Acquisition

                   Data Insights

                Formulating Machine
                 Learning Problem

                 Offline Evaluation
Product Cycle
                Production Prototype

                 Online Evaluation

                    Production

                  Model Versioning

what users click on
                                       what users see but do not click on
                                       what users do not see

Logging
          first page results

                                it is important to log negative samples
                                it is important to allow exploration

what users click on
                                      what users see but do not click on
                                      what users do not see

                                it is important to log negative samples
Logging                         it is important to allow exploration
                                non-trivial data pre-processing ETL jobs are
          first page results     needed
                                along with robust querying interfaces (e.g.,
                                 Athena)

Tips for booking your next flight

                good for last minute booking

Competitive     average savings of 9% on return ticket

Combinations    90% of competitive combinations are
                 from top-30% airlines

                good deals when flying from US, UK,
                 Spain, Germany, Italy and other origins

Metrics
             Coverage: How many of all possible cheap itineraries we recall

                 Cost: How much queries for flight quotes are required

                   Classify whether for a query Q a combination of partners (X and Y) is
Supervised                        going to be in Top-10 search results
Learning
                                                 Dataset
             •     sample all possible combinations for a share of searches
             •     collect examples of competitive and non-competitive combinations

Use your favorite classifier

             perfect predictor

Supervised
Learning

                           Coverage                                  Cost

                                       heuristic-based baseline

In practice

Supervised   45%
Learning

               Coverage   5%                            Cost

              Tree ensembles (Random Forest) achieve good performs

Data Acquisition

                   Data Insights

                Formulating Machine
                 Learning Problem

                 Offline Evaluation
Product Cycle
                Production Prototype

                 Online Evaluation

                    Production

                  Model Versioning

Lean
Prototyping

               simple model trained in a Jupyter Notebook
               very hacky setup in production on a tiny share of traffic
               proved the value of ML optimization

everyday re-training

                                               one-off training
Model
Staleness

            Performance of the model stales, hence needs to be updated regularly

Serving Component

                                     5%                                       90%
                                                    Skyscanner
              Data Collection                                                            Current Model
                                                      Traffic

             Apache Kafka                                                                Update Model
                                                           5%

                                                 Experiments with       Report Failure      Passed?
               Data Archive                      Challenger Model
Production       AWS S3

Pipeline                                                             Validation Data
                                                                                         Model Validation
                                                                    5% of the last day
               Data Querying
                                          Pre-processing
                AWS Athena

                                                                     Training Data       Model Training
                                                                     7 recent days        scikit-learn
             Training Component (AWS CF + AWS Data Pipeline)

                            •     re-train the model everyday against model drift
                        •       run on a single large machine vs. distributed cluster

Serving Component

                                    5%                                       90%
                                                   Skyscanner
              Data Collection                                                           Current Model
                                                     Traffic

             Apache Kafka                                                               Update Model
                                                          5%

                                                Experiments with       Report Failure      Passed?
               Data Archive                     Challenger Model
Production       AWS S3

Pipeline                                                            Validation Data
                                                                                        Model Validation
                                                                   5% of the last day
               Data Querying
                                         Pre-processing
                AWS Athena

                                                                    Training Data       Model Training
                                                                    7 recent days        scikit-learn
             Training Component (AWS CF + AWS Data Pipeline)

                      •   sample all possible combinations on 5% of users’ traffic

Serving Component

                                    5%                                       90%
                                                   Skyscanner
                 Data Collection                                                        Current Model
                                                     Traffic

             Apache Kafka                                                               Update Model
                                                          5%

                                                Experiments with       Report Failure      Passed?
                 Data Archive                   Challenger Model
Production         AWS S3

Pipeline                                                            Validation Data
                                                                                        Model Validation
                                                                   5% of the last day
                 Data Querying
                                         Pre-processing
                  AWS Athena

                                                                    Training Data       Model Training
                                                                    7 recent days        scikit-learn
             Training Component (AWS CF + AWS Data Pipeline)

             •     update the model if it passes the tests and serve it to 90% of the users
                         • leave 5% for A/B experiments with better models

(Origin, Destination, Provider) rules

Temporal
Stability

            We need a mechanism to control temporal stability of the model

Data Acquisition

                   Data Insights

                Formulating Machine
                 Learning Problem

                 Offline Evaluation
Product Cycle
                Production Prototype

                 Online Evaluation

                    Production

                  Model Versioning

Online
Experiments

              + Test in the real world          - Some things are difficult to A/B test
              + Benchmark in equal conditions   - Online experiments might be expensive

45% of all competitive combinations for only 5% of the cost

                  22% of search results with cheaper itineraries
Travelers First
                  20% rel. increase in bookings on combination itineraries

                  0.74% rel. increase in user retention

Data Acquisition

                   Data Insights

                Formulating Machine
                 Learning Problem

                 Offline Evaluation
Product Cycle
                Production Prototype

                 Online Evaluation

                    Production

                  Model Versioning

Can we improve performance with smart feature engineering?

Feature
Engineering

                                                                             Trans-Atlantic
                                                                             European
                                                                    London
                One-hot encoding                  Better encoding

                London Gatwick     [1 0 0 … 0 ]   London Gatwick    [1.0 0.9 0.9 ...]

                London Stansted [0 1 0 … 0]       London Stansted   [1.0 0.9 0.1 …]

                Barcelona          [0 0 1 … 0]    Barcelona         [0.0 1.0 0.5 …]

word

                     [London, Barcelona, Frankfurt am Main, New York, ….]           sentence

             •       Option N1: Every user’s history is a sentence (think of Word2Vec)
             •       Option N2: Learn embeddings on graphs of locations

Location
Embeddings                                             Perozzi et al., KDD, 2014.

                 •     Option N3: Train embeddings for target       origin
                       problem
                                                                destination             …   competitive
                                                                                              or not

                                                                     …

Location
Embeddings
             •   Capture geographical proximity (Europe vs. Asia)
             •   Learn function of the airport (Heathrow and Gatwick vs. Stansted)
             •   Produce a slight improvement in prediction performance

 focus on right problems which cannot be solved without ML or where
              ML gives 10x improvement

             define the metrics and optimization objective at the start of the project
              and stick to them thereafter

             bootstrapping ML projects requires 20% of modeling and 80% of
Learnings     engineering – in the long run should be vice versa

             lean online experiments are important on early stages to make sure
              users engage with the product

             ML behavior in production reveals interesting problems which are not
              visible during offline modeling (e.g., temporal stability)

Join our Team!
    Dima.Karamshuk@skyscanner.net
on Twitter: @karamshuk @SkyscannerEng

You can also read

Muffatwerk Munich March 17th - 19th 2020 - participant.information - DATA festival

Big data and its application in shipping - Nautischer Verein Brunsbüttel e.V - Nautischen Verein ...

Shortened Data Protection Impact Assessment for COVID-19 Project use

Team training programme - Leadership challenge with Data Analytics

Ammonit data logger Meteo-40 - for the most accurate measuring data - Measuring wind power to the highest standards Data logger Meteo-40 will be ...

The Sizewell C Project - Transport Assessment Figures 1.1-12.1 8.5 - National Infrastructure Planning

The Future of Analytics - SPEAKER Peter Maynard, Equifax - Lend360

Parcelforce Worldwide - The importance of electronic pre-advice data

DEVELOPING WINDOWS MOBILE APPLICATIONS - TOM SLEE SQL ANYWHERE PRODUCT MANAGEMENT AUGUST 2009

Documentation for the Global Annual Average PM2.5 Grids from MODIS and MISR Aerosol Optical Depth (AOD), 2001-2010

Federal Politics Canada This Month - Public Opinion Research Release Date: April 16, 2021 Field Dates: April 1, 2021 to April 13, 2021 ...

Package 'infer' April 22, 2019 - The R Project for Statistical Computing

Consolidate data with Enterprise Repository - Philips

Observation usage in Météo-France data assimilation systems

AIR COOLER/BRINE MODEL DDB UNITCOOLER MODELDDX - DIRECTIONS FOR USE 61217

BOTTOM-UP MODELLING OF A CANTILEVER BEAM AND A PORTAL FRAME IN VIRTUAL REALITY (JELLY-SPINE MODEL)

Supplement of Forest-fire aerosol-weather feedbacks over western North America using a high-resolution, online coupled air-quality model

2021 WORLD OF CONCRETE END USER SELL-THROUGH

Is a Hot Dog a Sandwich? - Southeast Florida FSUTMS Users Group Travel Demand Forecasting and Model Application Friday, September 17, 2021 ...

SERVING COOLING / HEATING 2021 - HAGLUND INDUSTRI

AN OVERVIEW OF ML AND AI ON ARM BASED HPC SYSTEMS FOR WEATHER AND CLIMATE APPLICATIONS - JOINT IS-ENES3/ESIWACE2 VIRTUAL WORKSHOP ON NEW ...

NATURAL GAS PRICE OUTLOOK - August 9, 2021 - SpaceCraft

North Korea 2021/22 Seasonal Crop Outlook - Commodity ...

2020 First Quarter Review - Financial and Operating Highlights for the Quarter Ended March 31, 2020 - CNOOC Limited

Équipe artistique et technique / artistic and technical crew - les contes ...

CONSULTATION PAPER ON LICENSING OF PMR 446 RADIOS IN ZIMBABWE - CONSULTATION COMMENCEMENT DATE CONSULTATION CLOSING DATE - Potraz

INTER FIRE SCHOOL JANUARY 23 - 24, 2021 - Illinois Fire Service Institute

CLUB SAFEGUARDING ACTION PLAN (SEASON 2019-2020)

CONCUSSION NZF CONCUSSION & HEAD INJURY POLICY - FEBRUARY 2018 - Fit4Football

FINAL RESUME ON THE RESEARCH UNIT: Hcéres

INVITATION TO FINNISH JUDO OPEN 2021 TURKU FINLAND

Women's Leadership Programme - Africa Liberal Network - Westminster Foundation ...

Basque Future Leaders Seminar Course 2020/2021 - The Basque case for Sustainable Human Development - Agirre ...

Request for Application (RFA) for the - Employment Recovery Dislocated Worker Grant Released by: Indiana Department of Workforce Development July ...

"heating, cooling & ventilation that's just right." - 0861 BJCOOL (252665) | bjcooling.com - BJ Cooling

CLUB - RITMO GYMNASTICS 2020-2021 - Saskatoon

Strategic Improvement Plan 2021-2024 - Penrose Public School 2860 - AWS

Training Course: Alarm Management 25 - 29 October 2021 Baku - Global Horizon ...

Yoga fundamentals teacher training - 2020 Edition - Radiant Light Yoga

Test Coordinator Training - OCTOBER 6, 2020 - Kansas ...

Applying for Access Access Services - Applying for Access In-person Evaluation Required Submitting The Application Other Transportation ...

RAPID RESPONSE ORIENTATION - Presented by: Workforce Development Board Employment Development Department - Santa Cruz Human Services ...

LICY Policy Air Quality - Cycling Australia

ROTHER TSA AND EIP UPDATE MAY 2021

Leader in Training (LIT) Application - Summer Camp 2015

FINAL RESUME ON THE RESEARCH UNIT: Animal Genetics and Integrative Biology (GABI) - Hcéres

Bridgend County Swim Squad Job Information Pack - Performance Head Coach - Swim Wales

2021 U.S. Eventing Strategic Plan - USEF

PROCUREMENT PLAN - World Bank Documents

FBA - CANADIAN FOOD LABELLING ONLINE TRAINING PART 1A & 1B - KNOWLEDGE EXCHANGE / PART 2 - PRACTICUM