MongoDB Performance Fundamentals: Indexing & Indexes


9 minute read


I originally wrote this post to appear on the FloQast Engineering Blog.

I know how frustrating slow apps are. I also really care about my users’ time. One of the best ways to respect peoples’ time when using your applications is by ensuring your database is humming smoothly. I’ve learned how to work with MongoDB over the years in order to make sure my users’ time isn’t wasted and wanted to share some of those lessons here. In this post, we’ll start by learning about one of the most important concepts & tools available to you in MongoDB: indexes.

Database Performance

Database design and performance is a complex field. You have to think about consistency, availability, how well it handles network partitions… and to cap it all off the database has to be easy to use and run. So let’s boil it down to a few basic questions:

  1. “How fast can I read something?”
  2. “How fast can I update something?”
  3. “How reliable is my read/write?”
  4. “How durable is my update?” In this post, we’ll focus on question #1: how fast can I read data?

How Fast Can I Read Something?

You’ll probably encounter slowdowns on database reads before you encounter slowdowns in inserts or updates. That’s because many common database use-cases are proportionally read-heavy as opposed to write-heavy. Many databases will “just work” at small to even medium-scale without any additional tuning or effort. Most databases have been developed for 10+ years now and modern hardware is usually more than sufficient for many needs.

This really changes when you add more X. X can be volume, data size, read frequency, write frequency — whatever. When you start to scale your database, you’re going to need to do some work to make sure it performs as expected. When a product is smaller and getting less traffic or usage, we can use MongoDB out-of-the-box without doing any tuning. Mongo has great internal caching mechanisms and is pretty speedy in general. But as things grow and you add more features, users, and increas in overall volume, you have to take steps to ensure our users can close the books quickly and efficiently.

The first thing to do when scaling your database is to understand your data model and data access patterns. If you don’t have those down, scaling will be frustratingly-difficult or impossible. We’ve had to index and reindex our databases many times over the years as workloads have shifted and our needs have evolved.

In this post, we’ll take a look at some basic techniques you can use to improve the performance of your MongoDB database. To illustrate, let’s imagine a fictional service focusing on driverless-cars and ride-sharing: SomeCars. SomeCars is a very read-heavy application and has started to slow down, so we need to figure out how to improve performance. Below is an example schema for a riderin the SomeCars system. A few things to note:

  • the schema employs denormalization; there are rides embedded in the rider model as an array of subdocuments
  • user geolocation info is stored as lat/lng coordinates
  • some key info (email, ID, name) is also stored in the document
{
    "_id": "5cf0029caff5056591b0ce7d",
    "name": "A. Person",
    "email": "[email protected]",
    "avatarURL": "https://static.image.com/1234",
    "password": "$2a$14$ajq8Q7fbtFRQvXpdCq7Jcuy.Rx1h/L4J60Otx.gyNLbAYctGMJ9tK",
    "rides": [
        {
            "_id": "5cf0029caff5056591b0ce2f",
            "date": "01/01/2006",
            "driverID": "5cf0029caff5056591b0ce20",
            "from": {
                "lat": -70.55044,
                "lng": 165.39229
            },
            "to": {
                "lat": -29.9244,
                "lng": 88.2593
            }
        },
        {
            "_id": "5cf0029caff5056591b0ce2a",
            "date": "01/02/2006",
            "from": {
                "lat": -70.55044,
                "lng": 165.39229
            },
            "to": {
                "lat": -29.9244,
                "lng": 88.2593
            }
        }
    ],
    "location": {
        "current": {
            "lat": -70.55044,
            "lng": 165.39229
        }
    }
}

For this series, we’ve spun up a demo MongoDB cluster and populated with around 1.2 million rider documents, all randomly-generated. We’ll use this schema and data in the examples below.

Indexes to the rescue!

SomeCars users have reported slow in-app experiences to our awesome support team and HTTP response times are looking a lot higher in our Grafana metrics dashboard. So we definitely need to make some changes to our database and quickly! What to do? We already know our data model and data access patterns. What’s next? Indexes to the rescue!

live look at indexes coming in to help us destroy slowness

What’s an index? In database terms, an index is a data structure that improves the speed of data access, but at a small cost. In our data-heavy world, most every database has an indexing strategy and it’s usually an important part of the database’s overall feature-set. If you haven’t heard of indexes before, you’ve probably seen them in real life. Flip to the back of a book and it’ll probably have an index of where terms or topics appear in the book. Indexes used by databases are actually very similar. They provide a faster way for the database to look for items in a query, much the same as you might quickly scan the index to see where a term appears before exhaustively searching the book.

An index in action. Without an index, MongoDB must examine every document in order to fulfill a query. With an index, however, it can very quickly narrow down what it’s looking for.

Common MongoDB index types include:

  • Single: indexes on a single field (e.g. “email”)
  • Multikey: indexes for fields that hold an array value
  • Compound: indexes on multiple fields in a document

Single-field

There’s a SomeCars feature that does lookups on users using their email address, and we’ve noticed it’s slow (from our users and our dashboards). Let’s do a sample query to find an email we already know exists:

db.riders.findOne({ email: "Odessa [email protected]" });

This is a direct lookup and should be fast, especially since we actually know what we’re looking for in the database. However, the execution statistics show otherwise:

  • Documents Returned: 1
  • Index Keys Examined: 0
  • Documents Examined: 1188100
  • Actual Query Execution Time: 667ms

🤔🤔🤔Notice that we looked at every item in the database, leading to the 667ms query time. This might not seem like a long time, but it’s an eternity in database terms. Our users will definitely be frustrated by services that have to use this query. Let’s fix it with a single-field index:

db.riders.createIndex({ email: 1 }, { background: true });

We’re creating an index on the email field and we’ve done it in the background so we don’t disrupt running production systems (remember how important our users’ time is?). Creating indexes in the background takes longer but is worth the wait. Did you also notice how we set the index sort direction as ascending (going up):1? This won’t matter as much for single-field indexes but will be crucial for compound indexes.

After adding the index, let’s run the same query again:

db.riders.findOne({ email: "Odessa [email protected]" });
  • Documents Returned: 1
  • Index Keys Examined: 1
  • Documents Examined: 1
  • Actual Query Execution Time: 2ms

Much better!

Unique Indexes

The index we just created isn’t just going to help with performance. If we drop and recreate the index with the unique option, MongoDB will prevent inserts of documents with duplicate fields. Most index types in MongoDB can be unique. For example:

db.riders.createIndex({ email: 1 }, { background: true, unique: true });

This is a great way to do uniqueness checks really quickly and adds additional safeguards against duplicate data that can cause hard-to-resolve bugs or security vulnerabilities.

Multikey

Now that we’ve improved the performance of that query, we get another report from our users: queries that show rides taken with a particular self-driving car ID are taking a long time to load. The query looks something like the following:

db.riders.findOne({ "rides.driverID": ObjectId("5dee0e1467adb71ef2362050") });

The query performance is even worse than our first example due to having to also having to search through an array for a value:

  • Documents Returned: 1
  • Index Keys Examined: 0
  • Documents Examined: 1188100
  • Actual Query Execution Time: 1854ms

We are once again examining every document in the collection (a “collection scan”), which is a worst-case scenario. Let’s add an index to the driverID field in the rides array:

db.riders.createIndex({ "rides.driverID": 1 }, { background: true });

After creating a single-field multikey index, let’s see what our performance is like:

  • Documents Returned: 1
  • Index Keys Examined: 1
  • Documents Examined: 1
  • Actual Query Execution Time: 1ms

Much faster! Because we created a single-field index on an array item, MongoDB created an index entry for each item in the array. This is, in part, why we see nearly-identical performance between this and the email query. Multikey indexes have a few nuances to them, however, and the MongoDB documentation does a great job of covering any quirks.

Compound

Having solved our email and riderID queries, we find - yet again - that we have a poorly-performing query: an admin portion of the SomeCars app shows support users a view of rides taken within a time-range, sorted alphabetically by email so it’s easier to view in-app. The query looks something like this:

db.riders
    .find({
        "rides.date": {
            $lte: ISODate("12/01/2019"),
            $gte: ISODate("11/01/2019"),
        },
    })
    .sort({ email: -1 });

As expected, our query is examining a huge number of documents in order to return

  • Documents Returned: 1068298
  • Index Keys Examined: 1188100
  • Documents Examined: 1188100
  • Actual Query Execution Time: 3901ms

This is where compound indexes really come in handy. You may have been thinking to yourself earlier “But what if I’m not doing something as simple as looking up by email?” Compound indexes are often what you’d be looking for. They allow the creation of indexes using multiple keys in a particular order to improve both lookup and sort ordering. Whereas single-field indexes are traversable in either direction by MongoDB, you have to carefully think about index sort-order for compound indexes.

MongoDB will create a compound index using the order and sort-direction of fields you specify. This means you’ll want to think about not only the data you’re accessing (e.g., “email”, but the “when” of the data (how far will MongoDB have to scan to get to a document?) and how you are sorting that data. For our case, we want to as quickly as possible and narrow down the possible users’ email addresses to avoid doing a global sort of all user emails. We can use the rides.date field for this and set a descending sort direction. We will also specify an ascending sort order for the email field since our support users will be going forward alphabetically.

db.riders.createIndex({ "rides.date": -1, email: 1 }, { background: true });

Huzzah! The fields of our documents are working together and things are much faster. We’re still examining a good number of documents, but they are all within the time range specified, so we might not be able to do much better given our range-based query.

  • Documents Returned: 6703
  • Index Keys Examined: 6795
  • Documents Examined: 6771
  • Actual Query Execution Time: 143ms

What’s next?

Now you have a handful of tools to improve your queries in MongoDB using several kinds of indexes. Hopefully these cover most of your typical use-cases - they have for me. MongoDB also has index types to handle textual data for full-text-search applications and geographic queries for location-based data types.

comments powered by Disqus