Hotels.ng’s Machine Learning Hackathon: 50% of All No-shows Predicted Accurately!

Aug 14, 2015

Hotels.ng held a hackathon on Saturday, the first of August.

About 30 people showed up to hack on an interesting problem: can we predict which users are going to cancel a hotel booking? Are there criteria or things that somehow influence the decision to cancel a hotel booking?

Hotel booking no-shows are expensive for hotels for the following reason: they have to hold a room for someone who may never show up. Even a 10% improvement in hotels knowing on time that a room will be cancelled can save them tens of millions of naira!

So we  – and a bunch of Nigerian developers – set out to hack on the problem.

hackSo what was the result? Two teams did an amazing feat! The Andela software team was able to successfully predict 50% of all people that would cancel, while the MEST software team successfully predicted 33% of all cancellations!

What this means is that using their algorithms, we can basically reduce the number of unexpected no-shows by 50% (using the Andela algorithm) or 33% (using the MEST algorithm).

Think of the cost savings to us and the hotels: If 1000 people would cancel – we can get a prediction that 500 of them would cancel using these algorithms. We can then double our effort in finding out if they will stay (e.g with an additional phone-call), and if they will not be using the room they booked, we can free it up for the hotels.

Here come the technical bits – with each team explaining what they did:

The Andela Team

Andela photos
Our first challenge was to identify attributes for a particular booking that are most likely to influence customer cancellation,” Bernard said, explaining the methods he and his team used to create their own Algorithm.

The Andela team took advantage of the qualitative nature of the metrics/attributes provided and created binary classifications of the dataset against attributes. The next step on their game plan was to compare their classifications to Global Statistics, and the feedback received showed that their classification output was comparable to the statistics on global bookings/cancellations.

“To convert the qualitative data, we derived quantitative weights based off the appropriate class an attribute belongs to. This became our training set. Thereafter, we applied the decision tree learning algorithms to derive our predictions.”

The MEST Team

MEST photos

For the MEST team, the first step taken was to ask the Hotels.ng developers about the behavior of a typical hotels.ng user.

The next step was to try to complete a hotel booking using the hotels.ng platform. MEST developer Innocent says his team did this to “emphatically design a framework for gaining basic intuition about the real scenarios that would prompt a user to cancel a booking.” 

When they had gotten that out of the way, they copied the first 200 rows of data to a spreadsheet where they ran a pivot table on them.

It’s always good to look out for bulk insights,” said the bespectacled Innocent. “We discovered the sparse occurrence of cancelled bookings.”

Continuing, he said: “The first challenge was that we couldn’t find a trend from this data. From observing the data, it was difficult to tell if any variables were strong indicators that a booking would be cancelled.

“This made the problem more suited for a good machine learning algorithm.For machine learning, we used Sklearn – a python based machine learning package. We also used ipython notebook for development.

“We tried several algorithms (Logistic Regression, KNN classifier, and Support Vector Machine). We chose Support Vector Machine because we discovered that it had the highest model score (0.915).”

For the MEST team, the biggest challenge was cleaning the data. Handling misfitted rows, handling invalid data, Innocent says the ‘data wrangling’ was the most time-consuming task during the hackathon.

Meanwhile, Segun Famisa – a developer from Truppr – wrote the actual best performing code, but it was disqualified due to technical issues. His approach was also interesting:
Once we figured it was a supervised learning problem, we went through the dataset and determined which features were possible to extract from the data we were given. We basically needed to transform the raw data to something our machine-learning algorithm can take as input.

Having played around with Weka in Java for machine learning problems, for a change, I chose to use sci-kit learn (http://scikit-learn.org/stable/index.html) which is a machine learning library, for Python.

I was lucky to pair up with a python programmer – Tosin, at the event, who was very helpful in solving the challenge. After transformation of the raw data into numeric values we could work with, we went on to normalize the data (https://en.wikipedia.org/wiki/Feature_scaling), which basically means rescaling the attributes to values between 0-1. This prevents a certain feature (e.g Price of hotel bookings which could be as high as N100,000) from dominating another feature (e.g number of rooms which is usually 1 or at most like 50). We then had to do a dimension reduction, because at the end of picking our features, we had about 18 of them. We used a Principal Component Analysis (https://en.wikipedia.org/wiki/Principal_component_analysis) method to do the dimension reduction and feature selection.

The algorithm basically determined what features had more impact in making a decision. The data was then passed into the machine-learning algorithm for training. The machine-learning algorithm we used was Support Vector Machine (https://en.wikipedia.org/wiki/Support_vector_machine) , which is a supervised learning model useful for classification and regression problems. We thought it was suitable for this problem since we wanted to classify as whether the customer cancelled a booking or not. We were able to train the system, and generate a model; we ran tests on the model and got impressive results and made our submission.

The Andela and MEST teams took the first and second prize respectively.