Design your own fantastic recipe search engine!

What are we doing in this project?

When the lockdown of covid-19 still continues, we want to make our daily menus more colorful and interesting and here is the case we need such an intelligent recipe search engine. Actually, we have a lot of recipe websites to search our desired recipes, and we can find detailed descriptions and ingredients in it.

So let’s get started and know more technical details about building your own recipe search engine! All code snippets in this blog are in Python. Readers are expected to have a basic understanding of learning to rank(LTR) application in search engine with python.


The purpose of this project is to build a recipe search engine to provide information about recipes given keywords such as input ingredients, general dish description. Most related work corresponds to recipe recommendation systems where similar recipes can be retrieved given a sample.

Our recipe search system could offer different doable recipe choices if given the keywords including the available ingredients at hand, required cooking time, or related dish type. For example, user can just enter “Asian egg fried rice” and then the system will return most relevant recipes to them.

By recommending recipe ideas by search results, the rising demand of home cooking under the quarantine of COVID-19 can be helped or satisfied. Additionally, it could also be helpful with online grocery shopping with our recipe system embedded. For the case of Whole Foods, users can enter a general keyword and our system will output potential recipe options.

What is expected from this system is that, when it is given a piece of query text, it will automatically search all the existed documents to return the top 10 relevant recipes for this text. We want to make sure the returning output as relevant as possible.


Like any machine learning project, search engine definitely need to absorb great amount of annotated data to learn from. Traditionally, classification problems in machine learning should have categorical labels for each sample. In search engine, the annotation is a little bit different. Since the search engine takes a query as input and get a recipe as output, we need to annotate the relevance score for each query-recipe pair. Each recipe record is a json object containing the following fields: a recipe title, a list of ingredients and measurements, instructions for preparation, and a picture of the corresponding dish.


Our data is scraped from the website:

We first define our own query set with 100 queries and then use api to scrape the top 100 urls of search results for this query. Each url corresponds to a recipe document and we crawl the content of them.


For the 100 relevant recipes, we rank them manually into 3, 2, 1, 0 and -1, where rank 3 is most relevant and rank -1 is irrelevant. We manually control the distribution of rank 3:2:1:0 to be 1:1:1:1. As a result, our annotated dataset has about 10000 relevant recipes and 10000 irrelevant recipes.


Our baseline model ignores the structure of our recipe data and only takes the words into account. So for the preprocessing part, we filter out the numerical data and only keep the textual data. We first store all our textual data in a json file, which is a dictionary of dictionary:


{“title”:”bread pudding”,




Then we preprocess it by filtering the undesirable characters:


Traditional Statistical Model: BM25

Traditional information retrieval ranking models are just statistical models, like BM25, Pivoted, Dirichlet, etc. They are not related to the “Training” process like machine learning and they just compute the ranking score based on the query and document terms information. However, these seemingly simple methods can achieve decent performance.

Here is a snippet of using metapy, a useful IR python library to implement BM25:

In the current folder, we should create a subfolder called “baseline-ir” and put these 4 files in it:

  • baseline_ir.dat: file of all docs;
  • baseline_ir-qrels: files of all query-doc pairs ground truth
  • baseline_ir-queries: files of all queries
  • lemur-stopwords: used in bm25 preprocessing function.

Learning to rank

As the theory of machine learning and deep learning(ML and DL) is making more progress, people comes up with the idea that ML/DL is also capable of learning the ranking relationship between samples. And the ranking of “query-doc” pairs is the key problem of information retrieval. Here we will encounter a method called Learning to Rank (LTR).

LTR, which is for training the model in a ranking task is one of the most popular models that have been widely used in the fields of Information Retrieval. LTR proves to be a good choice for our case because the searching process can be viewed as predicting the ranking score of each query-recipe pair according to their relevance. We don’t care about the ranking score produced by the model but the correctness of results order.

Pointwise RankNet

We make our first step of LTR with an existing liabrary: elastic search.

Elastic Search (ES) is one of the most popular technical stacks for distributed searching. Apartfrom RESTful api that makes implementation of search engines much easier, it also providesLearning To Rank api to train and use ranking models with machine learning methods. Withthe help of ES LTR api, we can build index, extract features and train models in a faster and reliable way.

ES LTR pipeline

As the flow chart shows, we take each recipe with their original fields and build index accordingly. We use Query-Document pairs to extract features for each field including BM25 score, Fuzzy score (Damerau-Levenshtein optimal string alignment), as well as phrase-based BM25 score. We then get lists of features and pair them with list of ranks and reduce the original problem to a classification problem. We train the Random Forest in RankLib as our model. For each new query, we calculate the features of all documents with regard to the new query and let the model predict document ranks and sort them accordingly to return the most relevant documents.

Pairwise RankNet

Then we make a more bold attempt, to make use of the neutral network as our LTR model.

Pairwise training

The most popular methods in LTR is pairwise learning. The “pairwise” means when training the LTR model, for each query we will construct multiple pairs of features from its original query-doc rating data. And note that we always put the query-doc pair with higher relevance score in the first place. This is the illustration of pair construction:

To make it more clear to demonstrate the pair data construction, we also prepare a snippet:

The first function get_pair_doc_data construct pairs of features from original feature matrix x_train, relevance score y_train and corresponding query_id .

The TwoPairDataset (for training) and SinglePairDataset (for predicting) help construct torch dataset in our desired format. They read from local json files which is a dictionary with

{'X': feature matrix, 'y':relevance scores, 'query_ids': X’s query_id}

We split train and val data in two different json files.

The get_loader function returns the dataloader for training and evaluating respectively.

Note that the training process needs pairs of features as input but predict/evaluation process only takes in one piece of feature at a time. Finally, we construct 1125127 doc pairs out of about 20000 piece of query-doc samples.

Feature extraction

Here feature extraction we use bert model as text encoder. Bert is a very popular NLP tool and we use the pytorch version from HuggingFace here. It is useful because it is very convenient to get a numeric vector out of a text input and BERT encoder has been proved powerful in lots of application scenes.

The sentence embedding is actually the hidden states of BERT’s transformers layers. We sum up the last 4 layers’ hidden states and average over all tokens in that sentence as embedding features and finally obtain a 768-dim vector for each input sentence.

BERT’s working mechanism is not the key of our problems here so we will not expand it here so we just show how to use it to encode our document and query features:

Learning to Rank pipeline

The pipeline of LTR is quite simple.

As the figure shows, after encoding the query and docs, constructing the query-doc pair then we input the pairs of features into RankNet model. Note that we use binary cross entropy as loss function.

The RankNet Model mentioned here is a neural network with 3 sets of fully connected+ReLU layers. We also add a dropout layer to generate an ensemble model and prevent over-fitting. Finally we add a sigmoid function to make a prediction from 0 to 1 indicating the probability that one pair’s relevance is higher than another pair. That is also why we compute the difference between two pairs outputed scores and then pass it into a sigmoid module.

Training process

Now, let’s code the training process!

In the training process, we wrap up all the steps mentioned above:

  • Define hyperparameters of training
  • Get dataloader from json file
  • Define loss function and optimizer for training
  • Define a RankNet Model
  • Training for data in train_loader
  • Evaluating for data in train_loader and val_loader with NDCG@10

We collects all the NDCG eval results and then show it in a plot.

Here is the NDCG calculation function:

The reference for model and training part codes:

Predict process

The predict process is rather simpler than training. Given a new query, we will extract its bert feature and concatenate it with all the existing document features. Then input it into the model to obtain the ranking scores. Finally, rank the documents according to their scores and select the top 10 recipes to return. This is the codes to show how this works:

Results and discussion

How to judge the success of our model? Here comes to the most important part: evaluation.

This part is not so complicated as the previous data preparation and model training but is of much importance because it directly reflect whether your models is doing well on its job. Also, it is necessary to make sure the evaluation results is reliable and there is no cheating on it. For example, any part of testing set should not be involved in the training process.

We plot the training history NDCG for to make the result more intuitive to understand.

From the above 2 figure, we can see that NDCG for training set is always better for testing set and testing set’s NDCG is fluctuating around 0.7 and the highest one is 0.8.

Also, after I add one more dropout layer in the RankNet, the fluctuation of test NDCG become smaller and training become more stable after easing the effect of overfitting.

Since we also try to use BM25 as baseline, we set a performance comparison between different models here:

From the result we can observe that, compared to baseline model(BM25), the search engine performance is largely improved by the introduction of learning to rank algorithm. Since BM25 doesn’t involve any of training process and the features used for computing ranking scores are simply statistical characteristics.

However, BERT is a more delicate and capable model to encode the text information, no matter the semantics or syntactic ones. What’s more, RankNet, a pairwise LTR neural network, is doing better job in capturing the ranking relationship between a pair of query-doc inputs.

All these reasons contributes to the final success!

What’s next?

But there are still some flaws to think about and we can do it better in the future:

The retrieval speed

If the volumn of potential documents increases to a scale of million or even billion, when there is a new query, it will take a long time to encode all the information with BERT and then process all the possible pairs of query with a ranknet. Speed is essential to the search engine. Therefore, we may consider use a baseline model like BM25 to recall only a small part of total documents to cut down the amount of the irrelevance documents. Then perform a more precise ranking with our LTR methods to filter the most relevant documents.

Methods to encode features

Now we just roughly concatenate together the BERT features of query and documents. However, since queries are usually a short sentence and documents are much longer text, we should consider a more appropriate encoding method for query and doc seperately. Then try some other methods to reduce the dimensions of the input vectors to accelerate the calculation.

Parameter tuning

RankNet has a lot of hyper paramters which need more tuning including the learning rate, hidden sizes of fully connected layer, dropout rate, number of layers in RankNet. Neural Network is a powerful technique but it is also very dependent on the parameter setting. The optimal set of parameter may lead to much better results! However, the time to train the model for one time is too long so we don’t have time to do more tuning. Maybe you can do much better after tuning the parameter!

Always exploring