On Information Retrieval Systems
A beginner-friendly introduction to Information Retrieval systems, this article highlights their key features and how they work.
The post primarily outlines the unique objective and functionality of IR systems: how they are different from regular databases and their significant function to meet specific user information requirements.
What is an Information Retrieval system?
An Information Retrieval IR system is a system that is designed to retrieve information from a set of documents, often unstructured, to satisfy a particular user’s information need. The defining characteristic of an IR system, which distinguishes it from a database, for example, is that the primary objective is to retrieve information that is relevant to the user’s information need, rather than retrieving exact matches to queries.
How does it work?
A typical IR system works in several steps to provide useful information in response to a user’s query. Here’s a high-level overview of the process:
- Collection and Storage. The first step is gathering the documents that will be searched. The IR system collects documents from various sources, which could be web pages, database entries, files on a computer, etc. These documents are then stored in a database.
- Preprocessing. This stage involves cleaning and formatting the collected data. It may include removing unnecessary data (such as stop words), performing stemming (reducing words to their root form), tokenization (breaking text down into its constituent terms or words), and encoding as vectors.
- Indexing. In this phase, the system creates an index that links each unique term with the documents it appears in. This index makes it possible for the system to quickly find all documents containing a particular term.
- Query Processing. When a user enters a search query, the system processes this query in much the same way as the documents (tokenization, stemming, encoding, etc.).
- Document Ranking. The system uses the processed query to search the index and find matching documents. It then ranks these documents based on their relevance to the query. There are various algorithms to determine the relevance, like term frequency-inverse document frequency TF-IDF, and more sophisticated ones like PageRank (which was originally used by Google).
- Result Retrieval. Finally, the system returns the highest-ranking documents to the user.
In the last couple of years, a lot of attention has been given to vector space models for document retrieval. The vector space model, at its core, represents text as vectors: sparse or dense.
A sparse vector is a vector in which most of the elements are zero. Algorithms that are used for sparse representation include:
A dense vector is a vector in which most of the elements are non-zero. In contrast to sparse vectors, where the majority of values are zeroes, dense vectors have meaningful values in most of their positions. In the context of NLP, dense vector representation captures more information in fewer dimensions compared to a sparse vector. Dense vectors are commonly referred to as embeddings (including word embeddings, document embeddings). Algorithms that can be used for dense representation of texts are:
There is no single option that is best for all tasks, and usually IR systems combine different approaches to provide more accurate and relevant results. In essence, it’s a way of leveraging the strengths of various approaches to mitigate the weaknesses of individual methods.
A common hybrid search strategy might involve the combination of text-based and semantic-based search methods: Text-based Methods that are based on sparse vector representations and Semantic-based Methods that are based on dense vector representations. A hybrid search system may first employ a text-based approach to quickly narrow down the potential documents from a large database. It may then use semantic analysis methods on this smaller set to identify the documents most relevant to the user’s query based on contextual meaning.
However, search methodologies are not limited to text inputs only. To explore the alternative options, let’s proceed to the following section.
In IR systems a query is usually the input given by the user to find matching documents. The system analyses the query and returns the most relevant documents or results.
Queries are diverse in nature. Here are some examples:
- Keyword Input. A query that contains specific words or phrases, aka keywords, to find documents or content relevant to that keyword.
- Semantic Input. In some cases, the query contains a high-level concept that the search engine is tasked with understanding. This could include natural language questions or other forms of semantic search.
- Image. The system will attempt to find similar or related images based on the content in the query image. This is often used in reverse image search systems.
- Structured Input. In certain scenarios a structured query that includes specific fields and values can be formed. For example, the exact column and value you’re searching for: language, country, category, etc.
- Temporal Input. Some queries can include a time component, such as querying for all documents that have been created in the last 24 hours.
- Location Input. Queries can also be based on location. Users can search for restaurants, shopping centers, or other amenities within a certain radius of a specific location.
A query can be a combination of beforehand mentioned input types.
As soon as a query comes into play, we need to discuss numerical vectors which represent the query-document relation:
- Query-independent or static features. These are features of a document
that remain the same no matter what the query is. They are not affected by the
search query. They can be used to evaluate the basic relevancy or authority of
a document irrespective of the query. These include features like:
- Length of the document
- Number of words in the document
- Publishing date
- Query-dependent or dynamic features. These are features of a document
which may vary with each query. These include features that are dependent on
the search query that are used for evaluating the relevance of a document,
given a specific query:
- Frequency of the query terms in the document
- Proximity of query terms in the document
- Matched Sequence Length, or position of the first occurrence of a query term in the document
- Query-level features or query features. These are features related to the
query itself rather than the documents. If you’re scoring/querying multiple
documents against a single query, query features can help understand the
context, the difficulty, or the nature of the query. These features can help
better understanding of what the user is actually searching for and can
therefore help in improving the relevancy of the results.
Example of query-level features are:
- Number of words in the query
- Type of the query (is it a question, a general phrase, etc.)
- Time of the search (query session)
- Whether the query belongs to a specific category like keeping in view past searches or based on user information
- Location: longitude and latitude or geo-position
One common initial step in the process of information retrieval is candidate generation. At this step, the system identifies a subset of items from a large corpus that could potentially be relevant to a query. These items are known as candidates.
The goal of candidate generation is to significantly reduce the number of candidates that need to be scored or ranked. For instance, an e-commerce site might have millions of products available, but only a few thousand are relevant to the query. By producing a candidate set, we can avoid the computational cost of retrieving and ranking all items in the corpus for every user query.
Candidate selection can be performed through various methods. One common approach is the use of inverted indices. In this method, an index is created that maps each unique token (word) in the corpus to the list of documents in which that token appears. For a given query, we can use the inverted index to quickly find a set of documents that contains all or some of the query keywords.
Another popular approach is Approximate Nearest Neighbor Search ANNS. ANNS can be used when a query and documents are represented as vectors. There are three categories of ANNS algorithms:
- Locality-Sensitive Hashing LSH. LSH is a method of performing probabilistic dimension reduction of high-dimensional data. Here, the basic idea is to hash the input items in such a way as to maximize the probability of collision; that is, similar items are more likely to end up in the same buckets.
- k-d Trees. A k-d tree is a binary tree where every node is a k-dimensional point. All points in the left subtree of a node have values less than the point’s value, and all points in the right subtree have values greater. k-d trees are extremely efficient for low-dimensional data, but their performance degrades as the number of dimensions increases.
- Ball Trees. Ball Trees are similar to k-d trees, but instead of splitting the data along a dimensional line, they partition data into nested hyper-spheres aka balls. This makes them more adaptable to data of varying densities, and they perform better on high-dimensional data than k-d trees.
- R-trees or Rectangle Trees. R-trees are tree data structures used for spatial access methods, i.e., for indexing multidimensional information, such as geographical coordinates. They handle dimensions better than k-d trees and are used extensively in geographic databases and systems.
- VP-trees or Vantage-Point Trees. VP-Tree is a metric tree that partitions data in a binary tree. At each level of the tree, it selects a vantage point and divides the rest of the points into two - points that lie within some distance and those that lie beyond. They are useful for spaces where the triangle inequality holds, making them a good option for taking advantage of intrinsic dimensionality in data.
- Cover Trees. Cover Tree is a tree-based data structure designed for fast exact and approximate nearest neighbor search in high dimensions. It is designed specifically to exploit the properties of metric spaces, spaces where a notion of distance (satisfying the triangle inequality) is defined.
- M-trees or Metric Trees. M-tree is a balanced tree data structure with a set of splits divides and indexes the multidimensional data that changes dynamically. The structure adapts well to insertions and removals, making it fit for the use in database systems.
- Random Projection Trees. Random projection trees are an efficient probabilistic dimension reduction technique used for nearest neighbor search in high-dimensional spaces. They work by reducing the dimensionality of the input space through a series of random projections and partitioning.
- Hierarchical Navigable Small World HNSW is the most notable example. It was developed to address the computational and memory resources limitations of classical k-Nearest Neighbors algorithm, especially in high-dimensional spaces with large datasets. HNSW constructs a hierarchical graph where each level of the graph corresponds to a subset of the data, and each level is a small-world network of the lower level’s subset. The bottom-level graph contains the entire dataset, which is a small-world network in itself. This design lets the algorithm navigate to nearest neighbors extremely quickly. HNSW has been shown to outperform many other nearest neighbor methods on benchmark tests. It is implemented in several libraries including Facebook’s Faiss, and Spotify’s Annoy
The goal at this stage is not to perfectly rank the results; instead, it’s about not overlooking potentially useful items. After generating the candidates, they are then ordered (or re-ranked) by a ranking model to deliver the most relevant results to the user.
Once candidates have been generated, the next step is to rank these candidates based on their relevance to the query. This is where the retrieval model or the ranking model comes into play. Learning to Rank LtR or is the application of machine learning algorithms to create a predictive model that can automatically rank a list of items or documents in a way that maximizes the utility of the ranked list for a specific task or user interaction. In other words, LtR involves training a machine learning model with an objective function that optimizes the quality of the ranked list of items.
The task of LtR can be classified into three types according to the loss functions1:
- Pointwise. This approach treats ranking as a regression or classification problem. The input is a single instance, and the output is a relevance score or label.
- Pairwise. This approach treats ranking as a problem of correctly ranking a pair of instances. The aim is to minimize the number of incorrectly ordered pairs.
- Listwise. This approach directly optimizes the quality of the ranked list by considering all the instances in a query at once.
The libraries frequently employed to construct LtR models include:
Another approach which is suitable for text queries and documents is to use Cross-Encoders2. A Cross-Encoder-based re-ranker3 has the potential to significantly enhance the final outputs for the end-user. This method simultaneously inputs the query and a potential document into a transformer network. Following this, a single score, ranging between 0 and 1, is produced. The derived score signifies the relevance of the document to the provided query.
There are two major ways to evaluate the information retrieval systems: using online and offline metrics.
Online metrics are based on controlled experiments, which allow to identify changes in user behaviour that can be attributed to changes of the Information Retrieval system. You can read about controlled experiments in the article about A/B tests.
Offline metrics are metrics that can be measured during against an existing dataset or historical data. Implementing measurement that doesn’t involve controlled experiments is generally easier. It can offer a valuable evaluation of an algorithm’s performance. However, the representativeness of the data set influences the reliability of the calculated offline measure. During online evaluation, the system’s current deployment usually only provides labels reflecting users’ reactions. To assess a new version of the system, experimenters must predict user responses had the alternative system been employed. Doing so can cause significant bias: A new ranking algorithm’s top-placed items may have never been clicked because users didn’t see these items. As a result, this unsophisticated approach could underestimate the new algorithm’s online performance.
This survey provides an overview of online evaluation techniques for information retrieval, and shows how online evaluation is used for controlled experiments, segmenting them into experiment designs that allow absolute or relative quality assessments.
It’s critical to note that understanding the difference between how data is gathered and how results are evaluated is important. The reason we collect data is key: are we using it to see how well a system for searching information works, or are we looking to use that data to make the system better in the future? Are we trying to measure how well an Information Retrieval system performs (absolute evaluation), or are we trying to compare two different systems (relative evaluation)?
The following table provides information that can help answer the above questions. It’s a condensed summary of the overview discussed earlier.
|Granularity||Absolute evaluation questions||Relative evaluation questions|
|Document||Are documents returned by this system relevant?||Which documents returned by this system are best?|
|Ranking||How good is this system for an average query?||Which of these IR systems is better on average?|
|Session||How good is this system for an average task?||Which of these IR systems leads to better sessions on average?|
|User||How good is the engagement level?||Which of these IR systems is better for user engagement on average?|
Examples of online metrics are:
- Action rate. This refers to the percentage of interactions where a user performs a desired action compared to the total number of interactions.
- Click rank. A measure of user engagement, looking specifically at the order in which users click on items during their interactions. Variants include Mean Reciprocal Rank MRR, which calculates the multiplicative inverse of the rank of the first correct answer, Click-Through Rate at rank K CTR@K, and the special case CTR@1, which only considers the first item. Another variant is pSkip, which evaluates the propensity of users to skip certain items.
- Session success rate. This metrics shows the percentage of sessions (a session being the period of time a user is active without leaving or being inactive for a prolonged period) that result in a successful action.
- Session abandonment rate. This is the percentage of sessions where users leave before completing a specific process or task
- Zero result rate. This is a metric commonly used in search engine optimization, measuring the percentage of search queries that return no results. If this rate is high, it might suggest that the search function needs improvement or that users are searching for documents that simply doesn’t exist.
- Time to click. This metric evaluates the duration it takes for a user to make their first click after an interaction begins. It serves as a key measure of user engagement and responsiveness, with shorter times typically indicating higher engagement or user interest.
- Learned metrics. These are derived metrics that indicate deeper levels of user engagement and satisfaction. Examples include satisfaction score, a measure of how happy users are with their experience, or Net Promoter Score NPS, a gauging tool that tends to determine the loyalty of a firm’s customer relationships.
Many offline metrics have @K suffix. K stands for the top-K documents or items retrieved by an IR system. Here is a list of some of them commonly used in the field:
- Precision@K. The number of relevant documents retrieved by a search query, divided by the total number of documents retrieved. This metric is often used in situations where we are interested in the top results retrieved by the system.
- Recall@K. The number of relevant items retrieved by the system, divided by the total number of relevant items. This metrics is important when it’s important that all relevant items are included in the result set. This can be particularly important in situations where missing a relevant result could be costly or problematic, such as in medical or legal searches.
- F-score@K or F-measure@K. The metrics combines Precision@K and Recall@K as a harmonic mean. The reason for using the harmonic mean instead of a simple average is that it punishes extreme values. A system with perfect precision but terrible recall, or vice versa, will have a low F-score. This makes the F-score@K an ideal metric when you want to balance precision and recall.
- Fall-out or False Positive Rate FPR. It measures the proportion of negative instances (i.e., non-relevant documents) that are incorrectly classified as positive (i.e., relevant). In other words, it represents the probability of falsely identifying a non-relevant item as being relevant.
- Discounted Cumulative Gain DCG. In information retrieval, it is often important not only to recommend relevant items, but also to recommend the most relevant items first. The metric places higher emphasis on items that are ranked highly. Additionally, DCG is often normalized by Ideal Discounted Cumulative Gain IDCG, which is the DCG score of the ideal (best possible) ranking of items. This normalized measure is called Normalized Discounted Cumulative Gain NDCG. The normalization helps to compare the scores across different queries, which might have different numbers of relevant items and different levels of relevance. NDCG takes a value between 0 and 1, where 1 means that the search results are perfectly ranked.
Returning to the question of metric reliability and how the representativeness of a data set impacts the dependability of the calculated offline measure, there are several methods to tackle this issue:
- Direct Outcome Models. These are models that define the connection between the context, the action, and the result. Once a sound model is created, it can be used effectively to simulate virtually all online experiments, including those assessing policies that are reliant on historical data.
- Inverse Propensity Score Methods. Instead of just operating an existing system, it’s behavior can be modified to collect data beneficial for future evaluations of other ranking mechanisms. This type of data collection, often referred to as exploration data, may purposely incorporate random changes into the system’s output and monitor the associated user feedback.
Build or Buy
- Search engines
- Meilisearch. Meilisearch helps you shape a delightful search experience in a snap, offering features that work out-of-the-box to speed up your workflow.
- Typesense. Lightning-fast, Open Source Search.
- OpenSearch. OpenSearch is a scalable, flexible, and extensible open-source software suite for search, analytics, and observability applications licensed under Apache 2.0. Powered by Apache Lucene and driven by the OpenSearch Project community, OpenSearch offers a vendor-agnostic toolset you can use to build secure, high-performance, cost-efficient applications.
- ElasticSearch. Elasticsearch is a distributed, RESTful search and analytics engine capable of addressing a growing number of use cases.
- LtR systems
- Metarank. An open-source ranking service. It can help you to build a personalized semantic/neural search and recommendations.
- Amazon Personalize. Amazon Personalize allows developers to quickly build and deploy curated recommendations and intelligent user segmentation at scale using machine learning (ML).
- Google Recommendations AI. Recommendations AI uses Google’s latest machine learning architectures, which dynamically adapt to real-time customer behavior and changes in variables.
- Learning to Rank for Information Retrieval by Tie-Yan Liu
- Online Evaluation for Information Retrieval by Katja Hofmann, Lihong Li, and Filip Radlinski