Mikel Sagardia

An Infinite Text Generator

2022-10-08T07:30:00+00:00

A Toy Recurrent Neural Network Based on LSTM Cells Which Generates TV Scripts

"If you give me an infinite number of bananas I'll type banana for you." Photo from Wikimedia.

The infinite monkey theorem states that a monkey writing random letters on a keyboard long enough can reproduce the complete works of Shakespeare. There is even a straightforward proof when long enough tends to infinity.

Now, I don’t plan to have monkeys in my cellar and I surely don’t have infinite time. But could maybe neural networks aid in that enterprise? It turns out, they can, and they are astonishingly effective even with small tweaking efforts.

Deep neural networks are amazingly good at learning patterns and one can take advantage of that to generate new and structurally coherent data.

Inspired by the great post from Andrej Karpathy in which he describes how text can be generated character-wise, I implemented a word-wise text generator which works with Recurrent Neural Networks (RNNs). My code can be found in this Github repository.

Are you interested in how this is possible? Let’s dive in!

Recursive Neural Networks and Their Application to Language Modeling

While Convolutional Neural Networks (CNNs) are particularly good at capturing spatial relationships, Recurrent Neural Networks (RNNs) model sequential structures very efficiently. Also, in recent years, the Transformer architecture has been shown to work remarkably well with language data – but let’s keep it aside for this small toy project.

In many language modeling applications, and in the particular text generation case explained here, we need to undertake the following general steps:

The text needs to be processed as sequences of numerical vectors.
We define recurrent layers which take those sequences of vectors and yield sequences of outputs.
We take the complete or partial output sequence and we map it to the target space, e.g., words.

Let’s analyze in more detail what happens in each step.

Text Preprocessing

Computers are able to work only with numbers. The same way an image is represented as a matrix of pixels that contain R-G-B values, sentences need to be transformed into numerical values. One common recipe to achieving that is the following:

The text is tokenized: it is converted into a list of elements or tokens that have an identifiable unique meaning; these elements are usually words and related symbols, such as question marks or other punctuation elements.
A vocabulary is created: we generate a dictionary with all the n unique tokens in the dataset which maps from the token string to an id and vice versa.
Tokens are vectorized: tokens can be represented as one-hot encoded vectors, i.e., each of them becomes a vector of size n which contains all 0-s except in the index/cell which corresponds to the token id in the vocabulary, where the value 1 is assigned. Then, those one-hot encoded vectors can be compressed to an embedding space consisting of vectors of size m, with m << n. Those embedded vectors contain floating point numbers, i.e., they are not sparse as their one-hot encoded version. That mapping is achieved with an embedding layer, which is akin to a linear layer, and it considerably improves the model efficiency. Typical reference sizes are n = 70,000, m = 300.

Note that, in practice, one-hot encoding the tokens can be skipped. Instead, tokens are represented with their id or index values in the vocabulary and the embedding layer handles everything with that information. That is possible because each token has a unique id value which triggers m unique weights only. The following figure illustrates that idea and the overall vectorization of the tokens:

Text vectorization: the word "dog" converted into an embedding vector. Image by the author.

Recurrent Neural Networks

Once we have sequences of vectorized tokens, we can feed them to recursive layers that learn patterns from them. For instance, in our word-wise text generator, we might input a sequence like

The, dog, is, eating, a

and make the model learn to output the target token bone. In other words, the network is trained to predict the likeliest vector(s) given the sequence of vectors we have shown it.

Recursive layers are characterized by the following properties:

Vectors of each sequence are fed one by one to them.
Neurons that compose those layers keep a memory state, also known as hidden state.
The memory state from the previous step, i.e., the one produced by the previous vector in the sequence, is used in the current step to produce a new output and a new memory state.

The most basic recursive layer is the Simple RNN or Elman Network, depicted in the following figure:

The model of a Simple Recurrent Neural Network or Elman Network. Image by the author.

In the picture, we can see that we have 3 vectors for each time step \(t\): the input \(x\), the output \(y\) and the memory state \(s\). Additionally, the previous memory state is used together with the current input to generate the new memory state, and that new memory state is mapped to be the output. In that process, 3 weight matrices are used (\(W_x\), \(W_y\) and \(W_s\)), which are learned during training.

Unfortunately, simple RNNs or Elman networks suffer from the vanishing gradient problem; due to that, in practice, they can reuse only 8-10 previous steps. Luckily, Long Short-Term Memory (LSTM) units were introduced by Schmidhuber et al. in 1997. LSTMs efficiently alleviate the vanishing gradient issue and they are able handle +1,000 steps backwards.

LSTM cells are differentiable units that perform several operations every step; those operations decide which information is removed from memory, which kept in it and which used to form an output. They segregate the memory input/output into two types, as shown in the next figure:

short-term memory, which captures recent inputs and outputs,
and long-term memory, which captures the context.

The abstract model of Long Short-Term Memory (LSTM) unit. Image by the author.

Therefore, we have:

Three inputs:
- signal/event: \(x_t\)
- previous short-term memory: \(h_{t-1}\)
- previous long-term memory : \(C_{t-1}\)
Three outputs:
- transformed signal or output: \(y_t = h_t\)
- current/updated short-term memory: \(h_t\)
- current/updated long-term memory: \(C_t\)

Note that the updated short-term memory is the signal output, too!

All 3 inputs are used in the cell in 4 different and interconnected gates to generate the 3 outputs; these internal gates are:

The forget gate, where useless parts of the previous long-term memory are forgotten, creating a lighter long-term memory.
The learn gate, where the previous short-term memory and the current event are learned.
The remember gate, in which we mix the light long-term memory with forgotten parts and the learned information to form the new long-term memory.
The use gate, in which, similarly, we mix the light long-term memory with forgotten parts and the learned information to form the new short-term memory.

If you are interested in more detailed information, Christopher Olah has a great post which explains what’s exactly happening inside an LSTM unit: Understanding LSTM Networks. Also, note that a simpler but similarly efficient alternative to LSTM cells are Gated Recurrent Units (GRUs).

From a pragmatic point of view, it suffices to know that LSTM units have short- and long-term memory vectors which are automatically passed from the previous to the current step. Additionally, the output of the cell is the short-term memory or hidden state, and since we input a sequence of embedded vectors to the unit, we obtain a sequence of hidden vectors.

Final Mapping and Putting It All Together

Usually, 2-3 RNN layers are stacked one after the other and the final output vector sequence can be mapped to the desired target space. For instance, in the case of the text generation example, I have used a fully connected layer which transforms the last vector from the output sequence to one vector of the size of the vocabulary; thus, given a sequence of words/tokens, the model is fit to predict the next most likely one.

A complete text generation pipeline. In the example, the vocabulary size is n = 10 and we pass a sequence of 5 tokens to the network. The embedding size is m = 3 and the hidden states have a size of 4. Image by the author.

As already mentioned, the output of an LSTM cell is a sequence of hidden states; the length of that sequence is the same as the length of the input sequence and each vector has the size of a hidden state, which can be different than the embedding dimension m (that hidden dimension is a hyperparameter we can modify). Since in our application we only take the last hidden state from that sequence, the defined RNN architecture is of the type many-to-one. However, other types of architectures can be designed thanks to the sequential nature of the RNNs; for instance, we can implement a many-to-many mapping, which is used to perform language translation, or one-to-many, employed in image captioning.

At the end of the day, we need to get the proper dataset we’d like to fit, apply the matrix mappings that match the input features with the target values and learn by optimization the weights within those matrices. Well, with RNNs, we need to consider additionally that we work with sequences.

After seeing a sequence of tokens, the trained model is able to infer the likelihood of each token in the vocabulary to be the next one. That functionality is wrapped in a text generation application that works as follows:

We define an initial sequence filled with the padding token and allocate in its last cell a priming word/token. The padding token is a placeholder or empty symbol, whereas the priming token is the seed with which the model will start to generate text.
The sequence is fed to the network and it produces the probabilities for all possible tokens. We take a random token from the 5 most likely ones: that is the first generated token/word.
The previous input sequence is rolled one element to the front and the last generated token is inserted in the last position.
We repeat steps 2 and 3 until we generate the number of words we want.

Results

To train the network, I used the Seinfeld Chronicles Dataset from Kaggle, which contains the complete scripts from the Seinfeld TV Show. To be honest, I’ve never watched Seinfeld, but the conversation does seem to look structurally fine :sweat_smile:

You can judge it by yourself:

jerry: you know, it's the way i can do. i don't know what the hell happened.

jerry: what?

george: what about it?

elaine: i think you could be able to get out of here.

jerry: oh, i can't do anything about the guy.

jerry: what?

george:(smiling) yeah..........

george: you know, you should do the same thing.

jerry: i think i can.

jerry: oh, no, no! no. no.

jerry: i don't know.(to the phone) what do you think?

george: what?

jerry: oh, i think you're not a good friend.

jerry: yeah.

jerry: oh, you can't.

jerry:(to the phone) hey, hey, hey!

jerry:(to jerry) hey hey hey, hey!

george: hey, i can't believe i was gonna have to do that.

george: i don't know how much this is.

kramer:(smiling to jerry) i don't know, i'm not gonna get it.

kramer:(pointing) oh!(starts maniacally pleased to himself, and exits) oh, my god, i don't know!

elaine:(pause) i can't believe i can't. i don't know how much i mean, i was just thinking about this thing! i mean, i'm gonna take it.

george: you know what you want?

elaine: oh yeah, well, i'm gonna go see the way to get it.

elaine: oh yeah, well, i am not gonna get a little uncomfortable for the.

george: what?

george: oh. i don't know what the problem is.

george:(smiling, to himself, he looks in his head.

george: i can't believe you said it was an accident.

elaine: yeah, but you should take some more

Conclusions

In this blog post I explain how the toy word-wise text generator I implemented works. The application uses Recurrent Neural Networks (RNNs) consisting of Long Short-Term Memory (LSTM) units; the parts and steps developed for it are common to many Natural Language Processing (NLP) applications, such as sentiment analysis or image captioning, and I try to answer the central questions around them:

Text processing: what tokenization and vocabulary generation are, and why we need to vectorize words in embedding spaces.
RNNs and LSTM units: what these recurrent layers do and the shape of their inputs and outputs.
Final sequence mapping: how the outputs from recurrent layers can be transformed into the target space.

I trained the model with the Seinfeld Chronicles Dataset from Kaggle and, although the generated text doesn’t make complete sense, the dialogues seem structurally similar to the ones in the dataset; in some cases, I read 1-3 sentences and I can almost hear the sitcom laugh track in the background :joy:

Which text would you like to capture and regenerate?

If you’re interested in more technical details related to the topic, you can have a look at Github repository of the project. Also, if you’d like to see how a very similar architecture as the one used here can be employed to generate text descriptions of image contents, you can have a look at my image captioning project.

From Jupyter Notebooks to Production-Level Code

2022-09-23T07:30:00+00:00

A Boilerplate Package to Transform Machine Learning Research Notebooks into Deployable Pipelines

A glimpse to my current notebook. Photo by the author.

I love writing and drawing on my notebook. In there, you’ll find not only formulas or flow charts, but also funny cartoons, interminable lists of things I’d like to do, shopping lists, or important scribbles my kids leave me every now and then. Therefore, one could say they are a unique window to what’s going on in my mind and life.

I think something similar happens with the Jupyter notebooks commonly used in data science: they are great because it’s very easy to try new ideas with code in them, you jot down notes beside the features you engineered or the models you tried, and everything is visually great – but the produced content often grows chaotically and it ends up being unusable in real life without proper modifications.

Photo by SJ on Unsplash.

I have also noticed that I start becoming more sloppy and lazy when I’m too long around notebooks; it’s like when you tend to leave your vegetables unfinished and indulge yourself with cookies for dessert. And then, you try to fit in that wedding suit and you realize it somehow shrunk.

Jupyter Notebooks are like chocolate cookies: You know you should eat them in moderation, but you can’t help sneaking the last one again.

Applying Software Engineering and DevOps to Research Code

Food metaphores aside, and using the jargon of the Software Engineering world, Jupyter notebooks belong to research and development environments, whereas deployed code belongs to production environments. Most of the data science projects don’t leave the research environment, because their goal is to provide useful insights. However, when the created models need to be used in online predictions with new data, we need to level up the code and the infrastructure quality to meet the production standards, characterized by a guarantee of reliability.

Machine learning systems have particular properties that present new challenges in production, as Sculley et al. pointed out in their motivational foundations of what is becoming the field of MLOps. Many tools which target those specific needs have appeared in recent years; those tools and the applications which use them are often categorized in maturity levels:

Level 0 (research and development): data analysis and modeling is performed to answer business questions, but the models are not used to perform online inferences.
Level 1 (production): the inference pipeline is deployed manually and the artifact versions are tracked (models, data, code, etc.) and pipeline outputs monitored.
Level 2 (very serious production): deployments of training and inference pipelines are done automatically and frequently, enabling large-scale continuously updated applications.

Small/medium-sized projects (teams of 1-20 people) require typically level 1 maturity, and often, the companies where they are implemented in don’t have the resources to go for level 2.

In this article, I present a standardized way of transforming research notebooks into production-level code; in MLOps maturity levels that represents the journey from level 0 to 1. To that end, I have implemented a boilerplate project with production-ready quality that can be cloned from this Github repository.

The selected business case consists in analyzing customer churn using the Credit Card Customers dataset from Kaggle. Data analysis, modeling and inference pipelines are implemented in the project to end up with an interpretable model-pipeline that is also able to perform reliable predictions. However, the package is designed so that the business case and the data analysis can be easily replaced, and the focus lies in providing a template with the following properties:

Structure which reflects the typical steps in a small/medium-sized data science project
Readable, simple, concise code
PEP8 conventions applied, checked with pylint and autopep8
Modular and efficient code, with Object-Oriented patterns
Documentation provided at different stages: code, README files, etc.
Error/exception handling
Execution and data testing with pytest
Logging implemented during production execution and testing
Dependencies controlled for custom environments
Installable python package
Basic containerization with Docker

However, two properties are missing to reach full level 1:

Deployment of the pipeline
Tracking of the generated artifacts (model-pipelines, data, etc.)
Monitoring of the model (drift)

Those are fundamental attributes, but I consider they are out of scope in this article/project, because they often rely on additional 3rd party tools. My goal is to provide a template to transform notebook code into professional software using the minimum additional tools it is possible; after that, we have a solid base to add more layers that take care of the tracking and monitoring of the different elements.

The Boilerplate

The boilerplate project from the Github repository has the following basic file structure:

.
├── README.md                         # Package description, usage, etc.
├── churn_notebook.ipynb              # Research notebook
├── config.yaml                       # Configuration file for production
├── customer_churn/                   # Production library, package
│   ├── __init__.py                   # Python package file         
│   ├── churn_library.py              # Production library
│   └── transformations.py            # Utilities for the library
├── data/                             # Dataset folder
│   ├── README.md                     # Dataset details
│   └── bank_data.csv                 # Dataset file
├── main.py                           # Executable of production code
├── requirements.txt                  # Dependencies
├── setup.py                          # Python package file
└── tests/                            # Pytest testing scripts
    ├── __init__.py                   # Python package file
    ├── conftest.py                   # Pytest fixtures
    └── test_churn_library.py         # Tests for churn_library.py

All the research work of the project is contained in the notebook churn_notebook.ipynb; in particular, simplified implementations of the typical data processing and modeling tasks are performed:

Data Acquisition/Import
Exploratory Data Analysis (EDA)
Data Processing: Data Cleaning, Feature Engineering (FE)
Data Modelling: Training, Evaluation, Interpretation
Model Scoring: Inference

The code from churn_notebook.ipynb has been transformed to create the package customer_churn, which contains two files:

churn_library.py: this file contains most of the refactored and modified code from the notebook.
transformations.py: definition of auxiliary transformations used in the data processing; complex operations on the data are implemented in Object-Oriented style so that they can be cleanly applied as with the sklearn.preprocessing package.

Additionally, a tests folder is provided, which contains test_churn_library.py. This script performs unit tests on the different functions of churn_library.py using pytest.

The executable or main function is provided in main.py; this script imports the package customer_churn and runs three functions from churn_library.py:

run_setup(): the configuration file config.yaml is loaded and auxiliary folders are created, if not there yet:
- images: it will contain the images of the EDA and the model evaluation.
- models: it will contain the inference models/pipelines as serialized objects (pickles).
- artifacts: it will contain the data processing parameters created during the training and required for the inference, serialized as pickles.
run_training(): it performs the EDA, the data checks, the data processing and modeling, and it generates the inference artifacts (the model/pipeline), which are persisted as serialized objects (pickles). In the provided example, logistic regression, support vector machines and random forests are optimized in a grid search to find the best set of hyperparameters.
run_inference(): it shows how the inference artifacts need to be used to perform a prediction; an exemplary dataset sample created during the training is used.

The following diagram shows the workflow:

Although the training and the inference pipeline represented by run_training() and run_inference() should run sequentially one after the other, the symmetry of both is clear; in fact, a central property of the package is that run_training() and run_inference() share the function perform_data_processing(). When perform_data_processing() is executed in run_training(), it generates the processing parameters and stores them to disk. In contrast, when perform_data_processing() is executed in run_inference(), it loads those stored parameters to perform the data processing for the inference.

Note that the implemented run_inference() is exemplary and it needs to be adapted:

Currently, it is triggered manually and it scores a sample dataset from a CSV file offline; instead, we should wait for external requests that feed new data to be scored.
The data processing parameters and the model should be loaded once in the beginning (hence, the dashed box) and used every time new data is scored.

Those intentional loose ends are to be tied when deciding how to deploy the model, which is not in the scope of this repository, as mentioned.

Finally, note that this boilerplate is designed for small/medium datasets, which are not that uncommon in small/medium enterprises; in my experience, its structure is easy to understand, implement and adapt. However, as the complexity increases (e.g., when we need to apply extensive feature engineering), it is recommended to apply these changes to the architecture:

All data processing steps should be written in an Object Oriented style and packed into a Scikit-Learn Pipeline (or similar), as done in transformations.py.
Any data processing that must be applied to new data should be integrated in the inference pipeline generated in train_models(); that means that we should integrate most of the content in perform_data_processing() as a Pipeline in train_models(). Thus, perform_data_processing() would be reduced to basic tasks related to cleaning (e.g., duplicate removal) and checking.

More details on the package can be found on the source Github repository.

Conclusions

In this article I introduced my personal boilerplate to transform small/medium-sized data science projects into production-ready packages without exposing to many 3rd party tools. The template works on the customer churn prediction problem using the Credit Card Customers dataset from Kaggle, but you are free to clone the boilerplate from its Github repository and modify it for your business case. Important software engineering aspects as covered, such as, clean code conventions, modularity, reproducibility, logging, error and exception handling, testing, dependency handling with environments – and more.

Topics such as data processing techniques, pipeline deployment, artifact tracking and model monitoring are out of scope; for them, have a look at the following links:

A 80/20 Guide for Exploratory Data Analysis, Data Cleaning and Feature Engineering.
A Boilerplate for Reproducible and Tracked Machine Learning Pipelines with MLflow and Weights & Biases and Its Application to Song Genre Classification.
Deployment of a Census Salary Classification Model Using FastAPI.
If you are interested in more MLOps-related content, you can visit my notes on the Udacity Machine Learning DevOps Engineering Nanodegree: mlops_udacity.

Do you find the boilerplate helpful? What would you add or modify? Do you know similar templates to learn from?

Practical Recipes for Your Data Processing

2022-06-28T07:30:00+00:00

The 80/20 Guide that Solves Your Data Cleaning, Exploratory Data Analysis and Feature Engineering with Tabular Datasets

Don't worry, working hard often pays off. Photo by Tim Gouw on Unsplash.

Thanks to the powerful packages we have available nowadays, training machine learning models is often a very tiny step in the pipeline of a regular data science project. Altogether, we need to address the following tasks:

Data Understanding & Formulation of the Questions
Data Cleaning
Exploratory Data Analysis
Feature Engineering
Feature Selection
Data Modelling

Additionally, if online inferences are planned, several parts from the steps 2-5 needs to be prepared for production environments, i.e., they need to be transferred into scripts in which reproducibility and maintainability can be guaranteed for robust and trustworthy deployments.

Independently from that fact and remaining still in the research and development environment, steps 2-5 consume a large percentage of the effort. We need to apply some kind of methodical creativity to often messy datasets that almost never behave as we initially want.

So, is there an easy way out? Unfortunately, I’d say there is not, at least I don’t know one yet. However, I have collected a series of guidelines and code snippets you can use systematically to ease your data processing journey in a Github repository. It summarizes the map I have sketched along the years.

In the repository, you will find two important files:

A large python script data_processing.py which contains many code examples; these cover 80% of the processing techniques I usually apply on tabular datasets.
The README.md itself, which sums up the steps and dos & don’ts in the standard order for data processing described above.

Some caveats:

The script data_processing.py does not run! Instead, it’s a compilation of useful commands with comments.
I assume the reader knows the topic, i.e., the repository is not for complete beginners.
The guide does not cover advanced cases either: it’s a set of tools that follow the 80/20 Pareto principle.
The guide focuses on tabular data; images and text have their own particular pipelines, not covered here.
This is my personal guide, made for me; no guarantees are assured and it will probably change organically.

Do you find the repository helpful? What would you add? Do you know similar summaries to learn from?

Planning Your Next Vacation in Spain

2022-06-23T10:30:00+00:00

Analysis and Modelling of the AirBnB Dataset from the Basque Country

Donostia-San Sebastian. Photo by @ultrashricco from Unsplash.

In 2020 I decided to move back to my birthplace in the Basque Country (Spain) after almost 15 years in Munich (Germany). The Basque region in Spain is a popular touristic destination, as it has a beautiful seaside with a plethora of surfing spots and alluring hills that call for hiking and climbing adventures. Culture and gastronomy are also important features, both embedded in a friendly and developed society with modern infrastructure.

When the pandemic seemed to start fading away in spring 2022, friends and acquaintances from Europe began asking me about the best areas and trips in the region, hotels and hostels to stay in case there was no room in my place, etc. The truth is, after so many years abroad I was not the best person to guide them with updated information; however, the AirBnB dataset from Euskadi (i.e., Basque Country in Basque language) has clarified some of my questions. The dataset contains, among others, a list of 5228 accommodations, each one of them with 74 variables.

Following the standard CRISP-DM process for data analysis, I have cleaned, processed and modelled the dataset to answer three major business questions:

Prices. Is it possible to build a model that predicts the price from the variables? If so, which are the most important variables that determine the price? Can we detect accommodations that, having a good review score, are a bargain?
Differences between accommodations with and without beach access. Surfing or simply enjoying the seaside are probably some important attractions visitors seek on their vacations. However, not all accommodations are a walk distance from a beach. How does that influence the features of the housings?
Differences between the two most important cities: Donostia-San Sebastian and Bilbao. These province capitals are the biggest and most visited cities in the Basque Country; in fact, their listings account for 50% of all offered accommodations. However, both cities are said to have a different character: Bilbao is a bigger, modern city, without beach access but probably with richer cultural offerings and nightlife; meanwhile, Donostia-San Sebastian is more aesthetic, it has three beaches and it’s perfect for day-strolling. How are those popular differences reflected on the features of the accommodations?

The Dataset

AirBnB provides with several CSV files for each world region: (1) a listing of properties that offer accommodation, (2) reviews related to the listings, (3) a calendar and (4) geographical data. A detailed description of the features in each file can be found in the official dataset dictionary.

My analysis has concentrated on the listings file, which consists in a table of 5228 rows/entries (i.e., the accommodation places) and 74 columns/features (their attributes). Among the features, we find continuous variables, such as:

the price of the complete accommodation,
accommodates: maximum number of persons that can be accommodated,
review scores for different dimensions,
reviews per month,
longitude and latitude,
etc.

… categorical variables:

neighbourhood name,
property type (apartment, room, hotel, etc.)
licenses owned by the host,
amenities offered in the accommodation,
etc.

… date-related data:

first and last review dates,
date when the host joined the platform,

… and image and text data:

URL of the listing,
URL of the pictures,
description of the listing,
etc.

Of course, not all features are meaningful to answer the posed questions. The explanations given on my Gihub repository describe in detail how I dealt with noisy and missing values, and how some features were dropped or some other engineered. After that processing, we get a new table with 3931 entries and 353 features.

So… Would like to have a look at what I have learned from the data? Let’s dive in!

Question 1: Prices

In order check whether we can predict the price, I have trained several models with 90% of the processed dataset (i.e., the training split) using Scikit-Learn: (1) linear regression as baseline, (2) Ridge regression (L2 regularized regression), (3) Lasso regression (L1 regularized regression) and (2) random forests. Cross-validation was performed with all of them and their hyperparameters were tuned; additionally, the effect of polynomial features on the model performances was also studied, as thoroughly summarized on the Gihub repository.

The modelling experiments show that the random forests model seems to score the best R2 value for the test split: 69% of the variance can be explained with the random decision trees. Moreover, adding polynomial terms does not improve predictions for the present dataset. The following diagram shows the performance of the Ridge regression model and the random forests model with the test split using only the 353 linear features.

The models tend to under-predict accommodation prices; that bias clearly increases as the prices start to be larger than 50 USD. Such a moderate R2 is not the best one to apply the model to perform predictions. However, we can deduce the most important features that determine the listing prices if we compute the Gini importances, as done in the following diagram. The top-5 variables that determine the price of a listing are:

whether an accommodation is an entire home or apartment,
the number of bathrooms in it,
the number of accommodates,
whether the bathroom(s) is/are shared,
and whether the housing is located in Donostia-San Sebastian.

Note that only the top-30 features are shown; these account for almost 89% of the accumulated Gini importance (all 353 variables would account for 100%).

But how does increasing the value of each feature affect the price: does it contribute to an increase in price or a decrease? That can be observed in the following diagram, similar to the previous one. In contrast to the former, here the 30 regression coefficients with the largest magnitude are plotted - red bars are associated with features that decrease the price when they are increased, i.e., negative coefficients.

Being different models, different features appear in the ranking; in any case, both lists are consistent and provide valuable insights. For instance, we deduce that the price decreases the most when

the number of reviews per month increases (note that review positivity is not measured),
the host is estimated to have shared rooms,
the accommodation is a shared room,
and when the bathroom(s) is/are shared.

Finally, a very practical insight to close the pricing question: we can easily select the the accommodations which have a very good average review (above the 90% percentile) and yield a model price larger than the real one, as shown in the following figure. These are the likely bargains!

I prefer not post the URLs of the detected listings, but it is straightforward to obtain them using the notebooks of the linked repository :wink:.

Question 2: To Beach or not to Beach

Of course, you can always go to the beach to catch some waves in the Basque Country, but going on foot and in less than 15 minutes has an additional cost, on average. That is one of the insights distilled from the next diagram.

This difference or significance plot shows the T and Z statistics computed for each feature considering two independent groups: accommodations with and without beach access. These statistics are related to the difference of means (T statistic, for continuous variables) or proportions (Z statistic, for discrete variables or proportions). If we take the usual significance level of 5%, the critical Z or T value is roughly 2. That means that if the values in the diagram are greater than 2, the averages or proportions of each group in each feature are significantly different. The probability of being otherwise but incorrectly stating that they are different is 5%.

The sign of the statistic is color-coded: blue bars denote positive statistics, which are associated with larger values for accommodations that have beach access.

Long story short, here’s the interpretation: the group of accommodations that have a beach within 2 km have significantly larger

proportions of accommodations located in the province of Gipuzkoa, compared to Bizkaia,
proportions of accommodations with a waterfront,
and prices.

Note that larger statistics don’t necessarily mean larger differences; instead, they mean that the probability of wrongly stating a difference between groups is lower.

Instead of reading the ranking top-down, it is more interesting to compose a profile of listings with beach access and without by selecting features manually; for instance, the accommodations on the seaside:

have larger prices,
are more often entire homes or apartments,
usually have less shared bathrooms,
have more often a description in English instead of in Spanish (i.e., they target more foreign tourists),
have more often a beachfront, patio or balcony,
have more bedrooms,
allow for more accommodates but for longer minimum periods,
their host lives more often nearby,
…

Going back to the price, the following figure shows the different price distributions for accommodations with a beach in less than 2 km and further. We need to consider that there is such a distribution or a contingency table behind each of the Z/T statistics in the previous diagram.

Question 3: Athletic de Bilbao vs. Real Sociedad

If you’re a soccer fan, maybe you’ve heard about the Basque derby: Athletic de Bilbao vs. Real Sociedad. Both football teams are originally from the two major cities in the Basque Country, Bilbao and Donostia-San Sebastian, and they represent the healthy rivalry between the two province capitals.

In order to determine the differences between the two cities in terms of listing features, I have computed the same difference or significance plot as before, shown below.

Donostia-San Sebastian seems to have

larger prices,
more accommodations with waterfronts,
more descriptions in English,
hosts that joined AirBnB longer ago and who have more accommodations,
more often patios or balconies,
more often entire homes or apartments,
more space for accommodates,
…

On the other hand, Bilbao has

more accommodations that are a bedroom,
more shared bathrooms,
more amenities, such as shampoo, hangers, first aid kits, extra pillows, breakfast
…

Finally, as before, I leave the price distribution for both cities, since it is the feature in which the difference is more significant. We can see that the distribution from Bilbao has more units in the lowest price region, whereas the red city lacks listings with prices above 150 USD, compared to Donostia-San Sebastian. That is in line with several already explained facts, such as that Bilbao has more shared rooms, whereas Donostia has more entire homes, while being the effect on the price of both characteristics the opposite.

Conclusions

In this blog post, we took a look at the AirBnB accommodation properties from the Basque Country, narrowing down to these insights:

Even though the price regression models have a moderate R2, we have shown how to detect listings which are candidate to be a bargain: accommodations with high review scores and predicted price above the true one. Additionally, we have discovered the features with the largest impact on the price: type of accommodation, bathrooms, locations, etc.
Listings with a beach in less than 2 km have significantly more entire homes, more balconies, waterfronts and space for more accommodates; this is in line with their larger prices.
The two major cities Donostia-San Sebastian and Bilbao nicely align with the previous synthesis, being Donostia a beach city and Bilbao a city without. Additionally, Bilbao seems to favor other practical domestic amenities.

These conclusions are quite informal, but I hope they can guide my data-savvy friends; in any case, I’m sure you can have a great vacation anywhere you go in the Basque Country :)

Are you planning a trip to the Basque Country? Has this blog post helped you?

To learn more about this analysis, see the link to my Gihub repository. You can download the pre-processed dataset and ask the data your own specific questions!