Most elegant Data Structures

By AKM

June 16, 2025

Every undergraduate course in Computer science covers a range of data structures: Arrays, Linked-lists, Trees, Graphs, and Hashes. However, there are yet more data structures, some quite well known and some not as much. Here are a few of them which I think are especially noteworthy since they are clever, minimal and quite useful.

Skip List

Let us say you have to store a sorted list of numbers. You can store them in an array. Then searching will be fast, i.e. O(log n) using an algorithm like Binary search. However, inserting a new element will be slower, O(n). On the other hand, if you store them in a Linked-list instead, inserting will be fast, O(log n) but searching will have to scan the list and hence will be slower, O(n). skip list is a clever data-structure which achieves the best of both fronts, it takes on an average O(log n) for both searching and insertion.

Skip list contains a series of hierarchical linked lists. At the lowest level is a regular linked-list, which stores all the numbers. A higher level list contains only some of the elements, i.e. they skip over some elements in the lower level. An even higher level, skips some more elements from the levels below it. The lists at the higher levels are built probabilistically, i.e. while inserting an element, it is probabilistically decided how many levels it will be inserted in.

To search for an element, start with the highest level. If you are lucky, the element will be there. Otherwise, find the first element greater than or equal to the target value, then move down a level and continue search in there. Keep repeating till you hit the lowest level where the element is either present or you know for sure that it is not present in the list. Insertion is similar to search, just that the element may also be added to some of the higher levels as well.

Skip lists are simpler to implement and also better than Balanced Trees in practice, although worst case time complexity of Balanced Trees are better. Another major advantage is that skip lists can be used in parallel algorithms - different processes can modify different parts of the list, which balanced trees cannot do. Skip lists are used in indexing structures of several databases.

References:

Trie or Prefix Trees

Let us say you have a need to store a lot of words, where you need to search the words by a prefix. This kind of needs may arise if you need to build an autocomplete or a spell checker using a custom list of words. If you store the words in an array, then searching through the array will take O(log n) time, assuming you keep it sorted. It may be acceptable, but the time taken still increases with the size of the list of words.

Prefix trees, also called Trie, provides a very nice solution to this problem. In a Prefix tree, words are stored in a tree format. Each node in the tree (except the root node) represents a character. Each path from root to a node represents a word or a prefix to a word. A node, which represents a full word, contains a flag to indicate that it is a full word. e.g. if a tree with words: where, what and when, looks like this:

Now, the time complexity to search for a prefix in such a tree is independent of the number of words stored. It only depends on the length of the word - it is O(log l) where l is the length of the word or prefix being searched. Prefix trees are ubiquitous. It is also used in Internet's IP routing and DNA sequence searches.

References:

Binary Heap

Let us say you have a need to store a list of numbers such that you should be able to retrieve the maximum (or minimum) any time very quickly. Then Binary heap is the solution. It lets you find out the maximum value with O(1) i.e. constant time.

A binary heap is conceptually a tree but, it is stored in an array. In a binary heap, each node is greater than its children, but children need not have any specific order between themselves. Inserting and deleting an element requires shuffling the elements to maintain this property, which takes O(log n) time.

Binary heap are often used in priority queues, where we always want to know the item with the highest priority but, we don't care so much about storing all the rest of the items as a sorted list. Note that, although insertion and extraction of root value (maximum item) are fast, i.e. O(log n), search and deletion are not efficient in a Binary heap.

Also, Fibonacci Heap, where insertion is also constant time and Binominal Heap, which supports merging of two heaps, are improved versions which are also worth looking into.

References:

Fenwick Tree or Binary Indexed tree (BIT)

Let us say, given a list of numbers, you have to calculate and maintain the running total of values up to each index, while numbers are constantly being inserted and deleted from the list. Now, pre-computing each of the sums will work, but they will have to be re-calculated after every change in the list. So, if there are frequent updates, maintaining the sums becomes expensive.

Binary Indexed Tree (also called Fenwick tree) provides a solution. It requires a preprocessing of the list, which takes O(n logn) time. But after that, all updates only take O(log n) time. It requires additional space of the order of O(n). Here, the numbers are stored as a conceptual tree which can be stored over a physical array.

The trick to achieving it is a crucial mathematical fact, that all positive integers can be represented as the sum of powers of 2 as shown below:

1 = 2⁰2 = 2¹
3 = 2¹ + 2⁰
4 = 2²
5 = 2² + 2⁰
6 = 2² + 2¹
7 = 2² + 2¹ + 2⁰
8 = 2³

So, instead of precomputing all indexes, we pre-compute only the indexes which are powers of 2 i.e. 0, 1 (2⁰), 2 (2¹), 4 (2²), 8 (2³), 16 (2⁴) etc. Then whenever, the sum to any particular index is required, we only need to add up the already available computed sums of its constituent powers of 2. Similarly, whenever an element is deleted or modified from the list, we only need to update the pre-computed sums of power of 2 indices instead of all of them. It's a neat trick!

References:

Bloom filters

Let us say, you are implementing a recommendation engine. You want to recommend new and interesting articles to a user. But at the same time, you do not want to recommend articles which this user has already read. Formally, the problem is equivalent to testing whether an element is a member of a set or not. In this case, the set is the list of articles already read by this user. However, the set will be unique to each user. So, we cannot simply store the list of all read articles by each user explicitly, since it will occupy a large space and will keep growing (as users read more articles but never unread an article).

A popular and fairly elegant solution is to store the set in a Bloom filter. It is probabilistic data structure. So, the accuracy is not 100%. A Bloom filter may return a false positive, i.e. it will say the article is in the set while it may not be, with a low probability. False negatives are never returned.

Besides recommendation engines, Bloom filters are also used to calculate cache hit checks. In a Bloom filter, elements can be added to the set but not removed.

A Bloom filter stores the information in an Array of m bits, all set to 0 initially. The length of the array, m, is fixed in advance and does not depend on the size of the set. Then, there are k different hash functions. Each hash function maps an element to one of the m bits.

When adding an element, calculate hash values of that element using each of the k hash functions and then set all these k bits to 1. Similarly, keep adding elements. Note that, there may be hash collisions, i.e. a particular bit may be set to 1 because of two different elements.

Checking whether an element is present in the set follows the same procedure - calculate the k hash values. If all of them are 1, then it is present (most likely). Otherwise, if any of the bits are not 1, then it is not present (definitely).

Since only a fixed length array is used, space requirement is very small. So, it is feasible to store the set per user for a recommendation system. Adding or querying involves calculating the k hash values, which makes it independent of the size of the set being stored. Also, adding and querying are near constant time operations.

References: