Building a Custom Search Relevance Training Set
Building a Custom Search Relevance Training Set from Open Source Bing Queries
There is a notable lack of large scale, easy to use, labeled data sets for information retrieval in most specific domains. We propose a method for generating them.
This repo includes code to generate search queries and passages related to a specific domain or field of knowledge. We achieve this by selecting a subset of the popular MS Marco dataset. The full MS Marco training set released by Microsoft is much too large to use in it’s entirety as a training set.
To build the dataset, clone repo and then run:
pip install -r requirements.txt
Examples of queries in the subset generated for health/biology:
- what normal blood pressure by age?
- what is your mandible?
- what part is the sigmoid colon?
Labelling 10k Passages with Google Natural Language API
Up to 30k/month document classifications are free using Google’s API. It can be used to classify passages into 700+ categories, and it also reports confidence scores.
You need to sign up for Google Cloud and authenticate your client first, see https://cloud.google.com/natural-language/docs/reference/libraries
Then run:
from google.cloud import language
from google.cloud.language import enums
from google.cloud.language import types
SUBSET_SIZE = 10000 # the number of passages to classify
client = language.LanguageServiceClient()
with open('./categories.tsv', 'w+') as outfile:
with open('./collectionandqueries/collection.tsv') as collection:
for i, line in enumerate(collection):
if i > SUBSET_SIZE: break
try:
doc_id, doc_text = line.split('\t')
document = types.Document(
content=doc_text,
type=enums.Document.Type.PLAIN_TEXT)
category = client.classify_text(document)
for cat in category.categories:
outfile.write(doc_id+'\t'+cat.name+'\t'+str(cat.confidence)+'\n')
except: # sometimes the document is too short and the API will err, ignore
pass
Creating a Text Classifier for the Rest of the Set
We use vowpal-wabbit(VW) to build a binary text classifier that can classify the rest of the set very fast and for free. Make sure it is installed. (type vw --help
on the bash).
Define a function to extract a binary label form the Google NLP Cateogry. In our case we use health/science:
def label_from_category(category, confidence):
return (1 if 'Health' in category
or 'Science' in category else 0, confidence)
Then use it to build a VW training set
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
import re
ps = PorterStemmer()
collection_file = './collectionandqueries/collection.tsv'
categories_file = './categories.tsv'
with open(categories_file) as categories:
categories_dict = dict()
for line in categories:
doc_id, category, confidence = line.split('\t')
categories_dict[doc_id] = label_from_category(category)
# input.vw has format <label> <weight> |n <lowercased, stemmed text>
with open('input.vw', 'w') as output, open(collection_file) as collection:
for line in collection:
doc_id, text = line.split('\t')
if doc_id in categories_dict:
label, confidence = categories_dict[doc_id]
tokens = [ps.stem(word.lower()) for word in word_tokenize(text)]
cleaned = re.sub(r'\:', ' ', ' '.join(tokens)) # strip colon bc this is special VW charater
output.write(str(label)+' '+str(confidence).strip()+' |n '+ cleaned + ' \n')
Then train a classifier with this data and save it as bio_model:
vw input.vw -f bio_model
Classifying MSMarco
- Download MsMarco collection+queries:
wget https://msmarco.blob.core.windows.net/msmarcoranking/collectionandqueries.tar.gz
-
Extract:
tar -xvzf collectionandqueries.tar.gz
-
Build the porter stmr cli from this repo: https://github.com/wooorm/stmr and make sure it’s in your path.
- Run
./classify_msmarco ./collectionandqueries
from MS Marco, producing a filepreds
of format {passage_id} {score}. The higher the score, the more likely it’s related to health / bio.
Building Collection and Queries for the Subset
The code for loading the collection and queries from MSMarco and then classifying them is a little more complicated so I won’t go over it here.
If you want to product the set, clone this repo and run the python script:
python3 build_dataset.py --data_dir <path to collectionsandqueries dir> --out_dir <bio-collectionsandqueries>
The output folder should contain:
collection.tsv
qrels.dev.small.tsv
qrels.train.tsv
queries.dev.small.tsv
queries.train.tsv
Look here for more details about the format of these.
Evaluation Results for Bio Subset
Pretrained Model | Finetuning Dataset | BioMARCO Dev MRR@10 [1] |
---|---|---|
bert-base-uncased-msmarco | MSMarco | 0.17281 |
biobert-pubmed-v1.1 | MSMarco | 0.17070 |
BM25 | - | 0.10366 |
Download dataset here or follow guide below to build it. Look here for more details about the format of the files.
Check out our main project NBoost to download and deploy the models with Elasticsearch.
[1] Reranking top 50 results from BM25