Easy Development Experience: VS Code CosmosDB / Mongo explorer

It’s nice when you can develop code and explore your database in the same development environment. It’s even nicer when it’s VS Code, where I spend a lot of my time these days.

Since I’ve been working with both MongoDB and CosmosDB for various projects, I found using the MongoDB driver a big productivity boost, and pointing it to my database (either CosmosDB or a MongoDB) a convenient boost, eliminating the constant flip between shells, apps, and mindsets.

The trick is to install Cosmos DB Support for VS Code extension in VS Code. Once installed, it enables another icon on the VSCode activity bar.

Connecting

Azure sidebar Expanding that side bar gives you a tree view for the databases (yes, plural: you can have several connected at once).

To connect to a database using connection credentials, you invoke the CosmosDB: Attach Database Account command. Next, VSCode will ask you which interface you would like to use.

DB Type

This is the great part: If you want to connect to a native MongoDB server, choose MongoDB as the interface, then paste in any* valid MongoDB url!

DB Type

Although the extension is labeled CosmosDB all over the place, it uses the MongoDB driver for making MongoDB API connections to CosmosDB instances that have MongoDB API enabled. This means it works just as well for native MongoDB databases!

If connecting to a CosmosDB instance, the easy connection option is to attache by signing in to Azure. Hit the “sign in to Azure” link. This will pop up a toast notification in VS Code, telling you to open a browser and enter a nonce code into the web page prompt. The web page will then let you log in to Azure, and once that’s done you can close the browser. If everything went well, a list of your Azure subscriptions under that account will appear in the tree view. While this option is nice and true to Azure account provisioning form, it is not always a good fit for development shops that provision connection string based credentialed access, but don’t allow developers direct portal or azure resource access beyond that.

Data Interaction

Once connected, you can see a listing of databases and collections in the tree view. To see some documents, double click a collection name. A listing of documents will appear in the tree view.

Double a document, and you will see its JSON representation as a docked tab, letting you visualize the document content with full JSON, outlining, and extended type support. Extended JSON represents the underlying BSON native types with full round trip fidelity.

{
"_id": {
"$oid": "5acbc21497c5242f8aba10b5"
},
"name": "bob",
"born": {
"$date": "2018-04-09T19:42:12.497Z"
},
"balance": {
"$numberDecimal": "3.14159265"
}
}

In the above document sample, the field _id contains an ObjectID type. Extended JSON writes it as an encapsulated value using the $oid field name. Similarly, the field born is assigned a BSON date type, represented with a $date value field containing a canonical string date representation. This is an important thing when you deal with money for example, so that the decimal resolution and math operations on the field would be correct, which is why you should use $numberDecimal - a high precision decimal - rather than a double data type.

Document Editing

When you open a document to view its content, you actually created a Document Tab. These tabs let you modify the single document using the full power of the IDE. JSON linting, outlining, formatting, syntax highlights - all those help you visualize and understand your document better. But you can also make modifications to the data! You can change fields and their content to shape the document to your liking. When done, you can update the source database 2 ways:

  • Using the Update to Cloud button
  • Using the File-> Save command ([CTRL] + S)

If you use the Update to Cloud button, the update will be carried out, and an output log line in the Output tab for the CosmosDB category will be revealed, showing the DB, collection name, and document id in URI-like format

update document button

10:49:12 AM: Updated entity "127.0.0.1:27017 (MongoDB)/db1/peeps/5b1eb2127f0f232ac43e0f42"

If you use the File->Save method, a toast will appear first, to make sure you actually meant to update the database.

save warning toast

The subtlety is that you may have inadvertently requested saving all documents - including the edit - but do not want to actually update the database. This can prevent some class of accidental modifications. You do have the choice to dismiss these warnings by choosing the Always upload button on the toast.

Beware that when you “Update to Cloud”, the content of the document in the tab will completely replace any and all values in the target database. Fields that exist in the database but not in your tab will be removed. If anyone else made changes to the database since you loaded it into VS Code, those changes will be gone. MongoDB provides “patch” like surgical updates, touching individual fields, but those are not available in single document update.

Scrapbook Support

Unlike the single document tab, Scrapbooks let you perform arbitrary Mongo shell commands against a connected database. This is where you have more control over updates, and can perform several operations rather than just slam update a single document. Contrary to the ephemeral connotation of the name “Scrapbook”, they can actually be persisted. Saving a scrapbook creates a new file in your root folder of the project, with the extension .mongo.

To get a scrapbook started, either choose the “plus-document” icon from the tree view banner, or invoke the VS Code command “Cosmos DB: New Mongo Scrapbook”.

New scrapbook

This will prompt you to connect anew, or choose an attached instance.

Select attached

Choosing to select an already attached instance, will further list the currently connected instances.

already attached

Choose one of those, and you’re presented with a listing of the databases.

choose db

Choose a database (or create a new one), and you’re in business!

The scrapbook surface is a persisted file containing MongoDB shell commands. Because it is persisted in the file tree, you can save it, edit it, and most importantly: version it! Source control is a great way to ensure your scripting of data access tasks is durable, repeatable, shareable and maintainable.

With the scrapbook, you are not limited to touching one document. Updates can span multiple documents using the update() or updateMany() collection syntax, and bring you the full power of advanced update operators like $addToSet, $inc etc. Any other shell command is available, so you can also remove documents, insert some, write some light javascript code to fetch and manipulate - anything you need.

Below, I scripted an aggregation query to find out number of orders per month in my “orders” collection. Hitting [CTRL] + [SHIFT] + ; executes the scrapbook in its entirety. (You can also execute only a portion of the code in the scrapbook: Select the code to run, and hit [CTRL] + [SHIFT] + '.)

Run scrapbook

Overall, the tight integration into a the IDE is a great convenience. I’m not ready to ditch my MongoDB shell yet, but not having to switch back and forth as much for daily simple tasks is awesome.

If you are working with CosmosDB with the graph interface, this extension lets you visualize graphs graphically. If you are working with the SQL document interface, it lets you connect in that mode as well. This extension is quite powerful, and has a much richer UI and options - certainly exceeding what its simple name suggests.


* Current extension version requires that the MongoDB url begins with mongodb://. While this works for most, it prevents using the perfectly valid prefix mongodb+srv:// used by some hosts. I submitted a pull request for this issue. Once merged, mongodb+srv:// should be usable as well.

Query Cost Estimate for CosmosDB using MongoDB Aggregation

CosmosDB offers MongoDB API access to data. This means that you should be able to take any application you wrote using the Mongo 3.4 wire protocol and point it to CosmosDB rather than a MongoDB instance. Well, kind of.. There are some caveats: Not all operators are supported, the consistency models differ, and the indexing strategy is completely different (exact details here. But other than that - Yay!

So I set out to kick the tires on the MongoDB API, against a cheap, fixed size CosmosDB instance, with the minimum - (400 RU) - allowed throughput.

Good news everyone!

It Just Works ™!

The decision to use CosmosDB as a managed data back end vs. MongoDB is not one you should take lightly. They offerings are not the same at all, even if the MongoDB API provides sufficient parity. One concern that always seems to come up is cost. Rightfully so: the CosmosDB pricing model is based on internal metering of Reserve Units/Second. And while the explanation of what RUs are can be clearly understood, mapping such a low level measurement to your actual workload is not as straightforward.

So how can you calculate your workload’s cost? For write operations, the cost can be easier to estimate. After all, when you insert a document, you can figure out the byte size of the document you are inserting on average, then multiply out by the rate of writing you anticipate.

For queries though, things are a bit more challenging. For one, the number of reads that CosmosDB will perform in order to satisfy your query is not necessarily known ahead of time. Even non-ad-hoc queries can net a highly variable number of documents. It depends on the indexing and the data variety your collection contains. Futher, CosmosDB may internally optimize data access such that some queries seem to have a 1-to-1 cost relation to the number of documents addressed, whereas other queries show lower cost. (I am guessing this is because internal block reads may cover multiple documents when documents are small and adjacent whereas larger ones cull form different or partial other I/O blocks. Just guessing.)

My applications typically produce queries using the MongoDB Aggregation syntax. Some drivers also produce Aggregation expressions, such as C# Linq etc. So I figured the best way to estimate the cost of such queries would be to run them and inquire as to their cost.

To get the cost of an operation, you can chase the operation with a Mongo command, like so:

db.db.adminCommand({ getLastRequestStatistics: 1 })

The admin command getLastRequestStatistics returns an object response that looks something like:

{
"_t" : "GetRequestStatisticsResponse",
"ok" : 1,
"CommandName" : "OP_QUERY",
"RequestCharge" : 2.27,
"RequestDurationInMilliSeconds" : NumberLong(3)
}

The field - and focus of this post - of interest is RequestCharge. Most importantly: your CosmosDB limits your throughput to whatever RU/s you provisioned for that collection. Therefore, summing up all operation costs in a time window gives you a decent estimate of your overall limits. Conversely, if you know your individual operation costs, you can provision appropriate throughput to match your live workload.

Great! I can issue an aggregation query to MongoDB from my Mongo shell, and then chase it with a “tell me how much was that” inquiry. But that gets tedious quickly… So I wrote this little node app. It’s a single page app that lets you type in various aggregation expressions, run them, and estimate the cost with the click of one button.

The UI lets you put in an aggregation expression (shown) or a “plain” find() query.

Enter Query or Aggregation

Then hit check to get the cost in terms of RU/s, plus a display of the result of the query. Something like this: Cost and result display

To run the app, you can just clone the source, then run:

node index.js "cosmos-connection-string"

You will want to do 2 things to the URL you get from the Azure portal as “MongoDB connection string”:

  1. The Azure provided connection string has “==” as part of the password in the URL. Replace these with “%3D%3D” - which is the proper URL encoding for the host portion of the URL.

  2. The Azure provided connection string doesn’t contain the destination database name. You should add the database name into the URL just before the query string, between the “/“ and the “?” characters. For example ...documents.azure.com:10255/mydbname?ssl=true&replicaSet=globaldb specifies using the database name “mydbname”.

More can be found here on how to get the connection string from the Azure portal.

Another thing to keep in mind is that this is only an estimate. It will vary from your actual runtime load (in either direction!) for various reasons, some of which are:

  • The number of documents in your collection changes over time.
  • The indexing state and type may differ.
  • The partitioning provided by CosmosDB spreads documents over actual storage, and that too can change over time.
  • Your queries - especially ad hoc ones composed on arbitrary fields with arbitrary values - can create vastly different queries than the ones you tested or estimated.

As always, reality checks using the actual metrics will beat any estimates produced in the comfort of your own lab. But since this little tool works using a connection string that a developer can get, it can be used even when Azure portal access is not directly available to a developer.

Push Notification Made Easy with MongoDB and GlingJS

GlingJS is a small component that lets you pump data events into the browser from MongoDB.

Imagine telling a customer lingering on a product catalog page “Hey! Someone just bought this very product!” or the animal shelter page popping up a picture every time a kitten is saved. Let your imagination run wild, you do you.

Cat Meme

The project was inspired by the recent introduction of MongoDB Change streams. Change streams are an efficient way to track changes occurring at the data level. It notifies you in real time each time a document is written or changed. You just need to specify which collection and under what data change conditions you want to get notified.

Ok, sounds lovely, but why GlingJS? Why another library on top of the node driver syntax?

  1. There’s a huge number of MEAN stack and node developers.
  2. They mostly build web applications.
  3. Real-time notification with WebSocket is better than polling or waiting for page transition.
  4. MongoDB is powering a staggering number of websites.

With GlingJS, you get real-time messages in the browser, with very little code.

Theory & Practice

First, you’d want to grab the package from NPM

npm install glingjs --save

Server Side

On the Node side, GlingJS consists largely of 2 working components: the ClientManager and Gling itself. In addition, WebSocketServerManager glues the WebSocket transport to the ClientManager

Client Manager

The client manager is the component that manages the WebSockets. It handles browser connection requests and topics. A topic is a subject that the browser might be interested in. Things like “when a customer registers” could be broadcast to the “newUser” topic. These are made up - they depend on your requirements so there’s no built-in set of hard-coded topics.

The client manager exposes a callback function broadcast(topic, message). This hook is what Gling uses to emit the database event upstream to the browser.

Your app would need one instance of clientManager, created thusly:

const clientManager = new ClientManager(config);

Web Socket Server Manager

The WebSocket Server manager is really just glue: WebSocketServerManager is instantiated to latch on to your Node http server, and attach WebSocket functionality to the Client Manager.

var webSocketServerManager = new WebSocketServerManager(httpServer, clientManager);

If your node app already has a constructed http server, you’d hand it to the component. If you are running a standalone Node instance(s) just for Gling, then you would need to set up an http server before reaching this point.

Gling

Gling is the component that deals with MongoDB change streams. It is also configuration based, so getting an instance of Gling is straightforward:

var gling = new Gling(config);

Gling exposes a function start(hook) which takes a callback matching the signature of ClientManager.broadcast.

Starting Gling looks something like:

// passing the ClientManager instance broadcast hook to Gling
gling.start(clientManager.broadcast)
.then(() => console.log('started gling..'))
.catch(reason => {
console.error(reason);
});

Gling handles all the business of subscribing to the MongoDB using the driver API, so you don’t have to.

Client Side

Browser

GlingJS uses WebSocket to perform the server side work.

Most browsers nowadays support the WebSocket standard, so no special library required.

The browser needs to open a WebSocket connection to the server, and provide the topic it is interested in.

<html>
<head>
<script type="text/javascript">
(function () {
// subscribing to 'meme-cat' topic, on host 'myservername'
var ws = new WebSocket("ws://myservername:80/gling", "meme-cat");
ws.onmessage = function (evt) {
const data = JSON.parse(evt.data);
//TODO: show data using your UI
};
window.onbeforeunload = function (event) {
ws.close();
};
})();
</script>

In the example above the web page subscribes to the topic meme-cat coming from the server myservername. Another page may subscribe to a different topic.

You might also notice window.onbeforeunload browser event, which disconnects the socket when a browser closes. It is good citizenship which helps the server clean up the server side resources and would help keep overhead low.

We mentioned topic a few times and eluded to a configuration. Let’s dig deeper!

Configuration

Listeners

For notifications to be useful, a client (web browser) would expect a few things:

  • That it gets messages it is interested in
  • That it doesn’t get messages it is not interested in
  • That the messages it receives have the field(s) it expects

The first 2 requirements really talk about the topic. The topic is pre-defined, and acts as a filter here: the browser says it wants cat memes, we’re going to send cat memes, but no dog memes.

The client also expects that certain fields are present. As a good convention, we’ll aim to have a single topic imply a set schema: all messages for a topic should have the same set of fields. Gling doesn’t enforce a schema, but by nature, it subscribes a topic to the same underlying documents, so the uniformity of those helps the browser get uniformly shaped messaged. Bottom line: your browser should expect that a single topic will receive the same type of message. It is possible to manufacture 2 different events and pump them under the same topic, but that can get really silly, and topics are free, so just come up with a new topic name.

Here is an example configuration:

var ChangeType = require('../src/changeType');
const Config = {
connection: 'mongodb://localhost:27017/gling?replSet=r1',
allowedOrigins: ['https://example.com/gling'],
listeners: [
{
collection: 'memes',
when: [ChangeType.create],
filter: { about: 'cat' },
fields: ['url','caption'],
topic: 'meme-cat'
},
//...
]
}
module.exports = Config;

Topics get created inside the listeners array. The fields control the delivery of the notifications: | Field | Description | |— |— | |collection| The MongoDB collection name| |when | Which type of changes we care about. This can be any of ‘create’, ‘update’, ‘remove’ | |filter| Is a match condition. Only documents that match this condition would be returned.| |fields| Which fields from the document would be returned| |topic| Topic name|

A multitude of listeners can be defined and supported by a single Node server, as long as it can handle the volume of notifications from Mongo and number of simultaneous browsers connected.

Filters can be compound expressions and include any operators that the aggregation framework supports. They are static though. Since change stream subscription is done once, you can’t keep changing the value in a match expression. That is to say, if you pass in the value {created: {$gt: Date()}}, the date compared will be the time the Node server started the process, not evaluated each time a document changes.

Field trimming is optional but allows you to shrink down the amount of data you send to the browser. This can help performance and also security. Given a Mongo document

{_id:12, name: 'bob', city:'La Paz', password: 'password123'}

You can configure fields to include only ['city','name'] and suppress data that should not be sent to the browser.

WebSocket Safety

To help protect browsers against scripting attacks, it’s best if you explicitly allow your domain(s) only. Pages on different domains would be rejected.

Setting allowedOrigins array is there for the ClientManager and your http server setup. ClientManager exposes a convenience method isOriginAllowed(thisOrigin, listOfAllowedOrigins) which does the math to figure if the current request origin should be allowed or not.

WebSocketServerManager takes care to reject origins not specified in the configuration.

An asterisk ‘*’ in the list of allowed origins will allow any origin! Use it in local development only, never in production!

Mongo Connection

The MongoDB server connection is specified using the connection field. Any legal MongoDB URI string would work. But since MongoDB Change Streams are built on the Oplog, the MongoDB server should be part of a replica set (even a replica-set of one server would do). For that reason, you should include the URL parameter replSet= to set the replica set name of your MongoDB cluster.

Conclusion

GlingJS is a simple and easy way to push notifications from MongoDB document change events directly into a browser using WebSockets. A demo project can be found here. The first part sets up an http server. The code to actually set up Gling and the ClientManager is minimal and simple. Check it out, see what you think!

MongoDB Change Streams - More event processing, less queue infrastructure.

What are Change Streams?

Change streams are a new way to tap into all of the data being written (or deleted) in mongo. Using change streams, you can do nifty things like triggering any reaction you want in response to very specific document changes.

For example, you have a user that registers to your website. You want to send them an email, or maybe put them on an on-boarding campaign. Maybe both? Using change streams you can hook into the live event itself - insert of a document to the Users collection - and react to that by immediately pinging your remarketing system with the details of the new user.

The flow looks something like this:

Simple Integration

Your website code only worries about registering the new user. Want to also send the new user a gift? No problem: have the listener also ping the warehouse. Want to add the user to your CRM system? You guessed it: just add a hookup.

Now you may wonder “Why this is news? Didn’t mongo always have tailable cursors? Can’t we just tail the oplog and do the same?” Sure we can. But when you tail the oplog, every mutation is going to be returned to you. Every. Single. Write.

Consider a moderate website, say 1000 users sign ups a day. That’s roughly 1 every minute or two. Not too bad. But what about the social media documents? Every tweet? Every page view you logged? Every catalog item visited? That can get pretty big. Each of these will be sent to you from the oplog, because each one is gets written into the oplog, regardless of if you need it or not. Extra work, extra load. Extra data you’re not interested in. Filtering though this is like putting a butterfly net over a firehose nozzle.

Change streams allow you to do this more efficiently. With change stream you get to customize several aspects of the stream that is returned to you:

  1. The specific collection you are interested in.
  2. The type of change you are interested in (inserts and updates only, for example).
  3. The specific parts of the changed document (if any) you want back.

These options give you great control over the source, size, and nature of the changes your listener app wants to handle. Rather than returning all writes to you and, and making you filter them out in application memory, the filtering is done at the server level. This reduces wire traffic, reduces your memory and CPU usage on the client. Further, it can save another round trip per document: if the mutation touched only one field - say, last login date - tailing the oplog would have only yielded you the content of the command mutation - the date value itself. If you needed the user’s email and name - which were not part of the command- they are not part of the oplog. Tailing the oplog would then not be enough, and you’d have to turn around and shoot another query (Joy! More I/O!) to get the extra details. With change streams, you get to include the extra fields in the returned event data by just configuring the stream to return them.

How to Consume Change Streams

Subscribing to change events depends on your particular driver support. Early beta had Node and Java driver support. I’ll focus here on Node.

There are 2 modes of consuming change streams: event driven, and aggregation pipeline. The aggregation pipeline way, is to just issue a $aggregate on a collection, with a mandatory new $changeStream pipeline operator as the first pipeline stage. You can introduce other pipeline stages after that stage, but it must be the first one. The resulting cursor would then return change items as the occur.

var coll = client.db('demo').collection('peeps');
var cursor = coll.aggregate([
{ $changeStream: {} }
]);
// .. now consume cursor as you would any

In the example above, we’re just saying that any write to the peeps collection in the demo database would be returned back in the cursor we opened. The cursor would remain open (baring errors or network issues or collection disappearance) so we can just process the events coming back sequentially.

Speaking of sequentially the event stream guarantees events returned would be in the order they were executed by MongoDB. The change stream system uses logical ordering that ensures you get the events in the same order mongo would have serialized them internally.

To include the full document affected, you can add the fullDocument option and a value of updateLookup. The other option is default which indicates you don’t need any extra data. Omitting the fullDocument option is equivalent to specifying the value default.

With the updateLookup value set, an event will include the queried value at some point after applying the operation that prompted this event. The subtlety here is that if there’s a high rate of change, your event may return a further future value of a document. For example, if you update a pageview.clickCount field twice at high rate, the event resulting from the first update may reflect the result of the second update. You will get 2 change notifications because 2 writes happened, and they will be reflected in order. But the lookup that brings in the current state of the document is not guaranteed to contain the first change because it is, in fact, a lookup performed after the event is queued for discovery by your change stream and not a point in time capture of the document.

var cursor = coll.aggregate([
{ $changeStream: { fullDocument: 'updateLookup' } }
]);

In the example above, the event returned from the cursor will have a field named fullDocument containing the document looked up at the server, without an additional round trip.

Speaking of event data, what else is returned?

{
"_id":{"clusterTime":{"ts":"..."},"uuid":"...","documentKey": {"_id":"..."}},
"operationType":"update",
"fullDocument":{"_id":"waldo","at":"2017-10-01","seen":"beach"}, // this is my document!
"ns":{ "db":"changeStreamDemo","coll":"peeps"},
"documentKey":{"_id":"waldo"},
"updateDescription":{
"updatedFields":{"at":"2017-10-17"},
"removedFields":[]
}
}

The change event includes a bunch of useful information:

  • An _id for the event. In case you wanted to stop and start again, save this one and ask only for changes that occurred since that id.
  • The operationType - discussed shortly.
  • The full namespace ns, including the database name and collection names.
  • documentKey pointing to the document _id subject of this event
  • fullDocument containing the document as it was just after write occurred. The word full may be a misnomer though: it depends on further projection you may add to retrieved a subset or a modified document. More about that in a bit. This field will not be present unless lookupDocument option was specified.
  • An update description, which contains the exact field value changes the values assigned in the updatedFields array, and the fields removed (think $unset) in the removedFields array.

Thus far, we just asked for any change. You can narrow down the change nature, using the various operation types during the stream setup. The options are:

  • insert
  • update
  • delete
  • replace
  • invalidate

The first 4 are self explanatory. invalidate occurs when a collection level change occurred which makes the document not available, other than delete. Think collection.drop().

To get notified only on a subset of the change types, you can request the change stream, and add a $match stage, like so:

var coll = client.db('demo').collection('peeps');
var cursor = coll.aggregate([
{ $changeStream: { fullDocument: 'updateLookup' } },
{ $match: { operationType: { $in: ['update', 'replace'] } } }
]);

Now only updates and document-replacements would be emitted back, filtered at the source.

This pipeline can also include transformations, so the aforementioned fullDocument field is really what you want to make of it.

var coll = client.db('demo').collection('peeps');
var cursor = coll.aggregate([
{ $changeStream: { fullDocument: 'updateLookup' } },
{ $match: { operationType: { $in: ['update', 'replace'] } } },
{ $project: {secret:0}}
]);

The code above will omit the secret field from the emitted event.

The $match stage can also apply other arbitrary conditions. When the updateLookup option is set, you can pretty much filter based on any document field or set of fields you want, using familiar aggregation syntax.

That was a quick rundown of consuming change streams using aggregation. But I actually like another option in the node driver: subscribing to change stream events as asynchronous events!

Here’s how that plays out:

// a filter describing the things i'm interested in
var filter = {
$match: {
$or: [{ operationType: 'update' },{ operationType: 'replace' }],
'fullDocument._id': 'waldo'
}
};
// the option to return the full document
var options = { fullDocument: 'updateLookup' };
// creating a change stream definition
var changeStream = coll.watch([filter], options);
// subscribing to events
changeStream.on('change', c => {
console.log('Waldo alert!', c)
});
changeStream.on('error', err => console.log('Oh snap!', err));

Using the code above, any update or replace to the document with _id ‘waldo’ in the collection coll will emit a change event.

My preference for this syntax is that it fits better into the modular async coding model. It also makes it easy to hook up the change function but define the handler elsewhere. This makes testing and reuse easy too. But that’s me - you do you.

Resume / Retry

In the face of network error, the driver to re-connect once automatically once. If it fails again, you can use the last _id of the change event (which will also be in the exception thrown), after you reconnect to your cluster. The responsibility for remembering where you were is on you (after the one-shot recovery the driver does for you). But both syntaxes allow you to add a resumeAfter option when requesting the stream, setting the value to the last change stream _id you successfully processed.

var lastProcessedChangeId = ...;
var options = { resumeAfter: lastProcessedChangeId, fullDocument: 'updateLookup' };

Due to the nature of large interconnected system, it is useful to design the change stream events in your application as ‘at least once’, and make any operation you trigger base on a notification idempotent. You should be able to run the same operation more than once with the same event data without adverse effect on your system.

Even when you limit the number, nature, and collection you listen to you may still end up with a large amount of changes to process. Or you may have subordinate systems that can’t handle your calls emanating from the events (think slow legacy systems). One thing you can do is deploy multiple listeners, each one taking on the same change definition, but also a partition filter on the change _id. A partition filter can be employed on any field actually, but the _id is present when you don’t ask for updateLookup too. The idea is to just run several listeners, each one with an extra field match on the modulus of the timestamp or something. If you need more capacity, you spin up more listeners and re-partition according to the number of listeners you need.

The change stream feature is currently in RC0 as of this writing, and should hit GA soon. I look forward to writing less code and incurring less I/O in the event processing component of our systems. It’s kind of nice to be able to use the same database infrastructure for all kinds of workloads, and this operational integration feature is a very welcome addition.

MongoDB - Don't be so (case) sensitive!

I have a collection of people, named peeps:

db.peeps.insert({UserName: 'BOB'})
db.peeps.insert({UserName: 'bob'})
db.peeps.insert({UserName: 'Bob'})
db.peeps.insert({UserName: 'Sally'})

And I want to be able to find the user named “Bob”. Except I don’t know if the user name is lower, upper or mixed case:

db.peeps.find({UserName:'bob'}).count() // 1 result
db.peeps.find({UserName:'BOB'}).count() // 1 result
db.peeps.find({UserName:'Bob'}).count() // 1 result

But I don’t really want to require my users to type the name the same case as was entered into the database…

db.peeps.find({UserName:/bob/i}).count() // 3 results

Me: Yey, Regex!!!

MongoDB: Ahem! Not so fast… Look at the query plan.

db.peeps.find({UserName:/bob/}).explain()
// ugg, collection scan:
// "winningPlan" : {
// "stage" : "COLLSCAN", ...

Me: Oh, I’ll create an index!

db.peeps.createIndex({UserName:1});
// query again...
db.peeps.find({UserName:/bob/}).explain()
// Ah! IXSCAN!
// "winningPlan" : {
// "stage" : "FETCH",
// "inputStage" : {
// "stage" : "IXSCAN", ...
// ignore case, still uses index
db.peeps.find({UserName:/bob/i}).explain()
// Ah! IXSCAN!
// "winningPlan" : {
// "stage" : "FETCH",
// "inputStage" : {
// "stage" : "IXSCAN", ...

Me: Yey!!!

MongoDB: Dude, dig deeper… and don’t forget to left-anchor your query.

// Run explain(true) to get full blown details:
db.peeps.find({UserName:/^bob/i}).explain(true)
// "executionStats" : {
// "executionSuccess" : true,
// "nReturned" : 3,
// "executionTimeMillis" : 0,
// "totalKeysExamined" : 4, // <<<=== Oy! Each key examined.
// "totalDocsExamined" : 3,

Me: Yey?

MongoDB: Each key in the index was examined! That’s not scalable… for a million documents, mongo will have to evaluate a million keys.

Me: But, but, but…

db.peeps.find({UserName:/^bob/}).explain(true)
//"executionStats" : {
// "executionSuccess" : true,
// "nReturned" : 1,
// "executionTimeMillis" : 0,
// "totalKeysExamined" : 1, // <<<=== Ok,
// "totalDocsExamined" : 1, // <<<=== Not Ok...

Me: This is back to exact match :-) Only one document returned. I want case insensitive match!

Old MongoDB: ¯\(ツ)… Normalize string case for that field, or add another field where you store a lowercase version just for this comparison, then do an exact match?

Me:

New MongoDB: Dude: Collation!

Me: Oh?

Me: (Googles MongoDB Collation frantically…)

Me: Ahh!

db.peeps.createIndex({UserName:-1}, { collation: { locale: 'en', strength: 2 } )
db.peeps.find({UserName:'bob'}).collation({locale:'en',strength:2}) // 3 results!
// "executionStats" : {
// "executionSuccess" : true,
// "nReturned" : 3,
// "executionTimeMillis" : 0,
// "totalKeysExamined" : 3, // <<<=== Good! Only matching keys examined
// "totalDocsExamined" : 3,

Me: Squee!

MongoDB: Indeed.


Collation is a very welcome addition to MongoDB.

You can set Collation on a whole collection, or use it in specific indexing strategies.

The main pain point it solves for me is the case-insensitive string match, which previously required either changing the schema just for that (ick!), or using regex (index supported, but not nearly as efficient as exact match).

Beyond case-sensitivity, collation also addresses character variants, diacritics, and sorting concerns. This is a very important addition to the engine, and critical for wide adoption in many languages.

Check out the docs: Collation

MongoDB Sharding & Chunks

If you want to scale your reads or writes using MongoDB, you are going to use sharding.

The nice thing about sharding, is auto-sharding. Mongo doesn’t only distribute the data across nodes for you, it automatically splits collection data and keeps the nodes fairly equally loaded with data. The core mechanism supporting this arrangement is the chunk.

So, what are chunks anyway?

When you want to split data in a collection horizontally, you need a way to track which documents live where. When you define a sharded cluster you pick a sharding key. That sharding key is the data value which is used from each document when deciding on which physical shard that document shall be located. To keep track of all the documents, the config servers need to have some way to persist the shard association. If we kept each shard key value pointing to a shard for each document, it wouldn’t scale very well. A billion documents will require a billion entries, making the config servers work hard and consume lots of memory and I/O. If we defined a hash function that takes the key and distributes the key-value space across some nodes, we wouldn’t need to save each key value, and could predict the shard a document lives on by applying the function only. The trouble with that is that if we added or wanted to remove shards, we’d have a hard time balancing or re-balancing documents across available nodes, as the original function wouldn’t have considered a different amount of nodes. Further, a formulaic key to shard association would not necessarily result in even load on shards. Although a key space (like an integer) may be evenly round-robin’ed across shards (think Mod(x) where x is number of shards), the documents in reality may occur in clumps that end up on one shard vs. the theoretical key-space distribution. No, these methods have severe challenges. What Mongo does instead is define chunks.

A chunk is simply a small record that describes a key range, and the shard that key range is associated to. By having a chunk describe a key range rather than each discrete key, we get around the issue of storing every shard key value found. Although chunks are stored on the config servers, the storage size is much reduced compared to storing each key, since ranges can/would encompass a large number of keys in one small descriptor.

Initially, a chunk is created which encompasses all possible keys. For a sharded collection with a shard key “score” this may look like:

{ "score" : { "$minKey" : 1 } } -->> { "score" : {"$maxKey":1} } on : shard0000

This descriptor states that any document with any score value (from theoretical minimum to maximum) will be located on the shard named “shard0000”.

Once you start populating the collection with documents, Mongo will start splitting chunks and migrating them to other shards in an attempt to keep data evenly spread across the shards. This action takes place when Mongo sees a chunk containing 64MB worth of data. Once a chunk is split, Mongo will move the one of the two chunks to another shard. It will copy all the documents which fall into the chunk’s range over to the new shard, update the config servers to point the chunk to the new shard, and finally clean up the documents from the old shard. All this work is automatic and baked into the sharding mechanism for Mongo.

Chunks are split based on actual seen key values. Mongo computes a splitting point on the shard key based on the actual keys in the shard. The nice thing about this is that given an uneven distribution along a key-space, the splitting will follow the density and reduce it. Let’s say you have a shard key on a credit score of your customers. The theoretical range is 0 - 850. But in reality, your seen scores - the actual scores - are such that the majority are between 600 and 750, with a few outliers below 650, and a few above. The chunk-split progression may look something like:

// one chunk, covering all
{ "score" : { "$minKey" : 1 } } -->> { "score" : {"$maxKey":1} }
// then 2 chunks,
{ "score" : { "$minKey" : 1 } } -->> { "score" : 0 }
{ "score" : 0 } -->> { "score" : {"$maxKey":1} }
// then the upper chunk will split
{ "score" : { "$minKey" : 1 } } -->> { "score" : 0 }
{ "score" : 0 } -->> { "score" : 419 }
{ "score" : 419 } -->> { "score" : {"$maxKey":1} }
// since most scores fall above 419, that chunk will again split given enough data
{ "score" : { "$minKey" : 1 } } -->> { "score" : 0 }
{ "score" : 0 } -->> { "score" : 419 }
{ "score" : 419 } -->> 639 }
{ "score" : 639 } -->> { "score" : {"$maxKey":1} }

This is all data and insertion order related. Based on the addition of documents at a point in time, it may grow and need to split. Chunks remaining small enough are left alone. Over time, and given a good enough shard key, the data will be roughly evenly split across the shards. This mechanism does require active management on Mongo’s part as opposed to fixed hash function over the number of shards. But it has the advantage of allowing you to grow the cluster with ease. If you add another shard to the cluster, Mongo can migrate chunks (without splitting) draining some load off existing shards. Another issue with static hash functions is that it relies on theoretical values, not actual. Imagine all scores are even for some reason, and that your hash function is simply odd/even to place documents on 2 shards. All the even documents will end up on one shard and the other shard for odd numbers will be empty. Not so with Mongo’s chunk mechanism: Mongo will split chunks as they become full, and will move chunks to the least full shard at that time. It’s a dynamic thing. It’s easy to figure out the data load because you can just count chunks. Sure, there might be some pretty empty chunks, but the imbalance over a large enough dataset is negligible.

Mongo takes care of leveling the data size load on each shard. But what it can’t do is level the query load or write load on the shards. Given 3 shards - call them A, B, and C - Mongo doesn’t know if your next 3 documents will end up all on A, or go to B and C. It’s up to you. You must pick a shard key that will have such distribution properties. This is one of the tougher tasks in designing a sharded cluster. The effects of picking a good key can make or break your cluster. (And no, the “score” field is not a good shard key by any means). The principle considerations for picking the shard key are:

  1. We want writes to be spread across shards as evenly as possible
  2. We want reads to be as local as possible
  3. We want chunks to always be splitable

We’re sharding because we want to realize more I/O and processing bandwidth using more servers. Writes are I/O intensive, and therefore spreading the writes across shards evenly will give us the best write-performance. A bad shard key will cause consecutive document write load to go to one of the shards, leaving others idle.

mongos will route queries to the shards it thinks the documents are on. It is capable of clubbing together query results from multiple shards and hand them over to you as on result set. But if your queries can be served by one shard only, the rest of them will be free to answer more/other queries in parallel. Query locality has to do with picking a shard key that is likely present in your queries, and for which the documents largely or wholly live on one or few shards.

A source of imbalance can arise from a giant chunk. A chunk will grow beyond 64MB if Mongo can’t split it. This happens when the shard key is such that a chunk fills up, and all documents in it have the same shard key value. The score value in our previous example is susceptible to this. Over time, many customers may have a score of 720 let’s say. But once a chunk contains all of those, Mongo won’t be able to split it because it won’t find a center point between 720 and 720… That’s bad. It creates for a giant chunk, and the situation just worsens when more and more documents have that key value because they all end up on the same shard. More load will go to that shard relative to others because mongo can’t split that chunk and won’t move it either (balancing the load is based on chunk count). So it really pays to pick a key that has ample range in reality, with your expected documents. An integer such as score has a huge theoretical range, but when used for this problem domain of score it doesn’t.

Chunks are the basis of the sharding mechanism in Mongo. They offer some clear benefits over other distribution mechanisms. There are even more advanced scenarios supported by chunks, such as tag-aware sharding. As an administrator, you can do a few things to affect chunk splitting and migration, taking over some control when needed. But sharding is largely automatic and self managing, letting you focus on writing and deploying your awesome apps.

For a video tutorial on configuring sharding and much more, see this Pluralsight.com course).

MongoDB 3.2 Goodies Coming Your Way: More ways to $unwind

The aggregation framework has a slew of new operators and pipeline improvements. One notable improvement is the more robust $unwind pipeline stage.

One of the main motivations for using Mongo is its flexible document model. Any document in a collection can have arbitrarily different fields and data types in fields. This is great for many reasons. But when it came to aggregating array elements it posed a few problems. Unwinding an array is done with the $unwind pipeline stage, but you could only specify the field to unwind.

Problem one was that the variety across documents created for non-existent arrays. Consider a cake collection, containing documents about cakes.

> db.cakes.find()
{ "_id" : "pound cake", "recipe" : [ "butter", "flour", "eggs", "sugar" ] }
{ "_id" : "brownies", "makeup" : "brownie" }

Pound cake has a recipe field, containing ingredients. Brownies are defined in a different document schema, listing the main constituents of the brownie in a makeup field.

If we wanted to create a listing of possible cakes by ingredient, we’d write something like:

db.cakes.aggregate([
{$unwind:'$recipe'},
{$group:{_id: '$recipe', found_in: {$addToSet:'$_id'}}}
])

But this will skip brownies because brownies don’t have field recipe at all. We’d have to synthetically add a recipe field in order to fix that like so:

db.cakes.aggregate([
{$project:{ fixed: {$ifNull: ['$recipe', ['-']]}}},
{$unwind:'$fixed'},
{$group:{_id: '$fixed', found_in: {$addToSet:'$_id'}}}
])

Adding an array with the single element ‘-‘ (or whatever token you want) was necessary because $unwind insisted that the field being unwound would both be an array type and contain an element. No element, not an array? $unwind didn’t emit the document at all.

But with 3.2, $unwind would allow you to emit a document placeholder even if there is no field or the field contains null. You activate this option by adding a second argument to $unwind named preserveNullAndEmptyArrays with a value of true or false. If you don’t specify the extra argument, then no document is emitted for null or empty array fields. This allows us the more concise expression

db.cakes.aggregate([
{$unwind: {path: '$recipe', preserveNullAndEmptyArrays: true}},
{$group: {_id: '$recipe', found_in: {$addToSet: '$_id'}}}
])
// result of this aggregation:
{ "_id" : "sugar", "found_in" : [ "pound cake" ] }
{ "_id" : "eggs", "found_in" : [ "pound cake" ] }
{ "_id" : "flour", "found_in" : [ "pound cake" ] }
{ "_id" : null, "found_in" : [ "brownies" ] }
{ "_id" : "butter", "found_in" : [ "pound cake" ] }

Using preserveNullAndEmptyArrays will emit documents that have either null value, or an empty array. The above aggregation example will produce a result with _id null for all cakes that don’t have a recipe field.

Nice.

But variety doesn’t stop at existence or non-existence of fields. What about a field that contains an array in some documents, but is a straight out string in another?

Consider these 3 cakes:

{ "_id" : "princess", "makeup" : [ "sponge", "jam", "sponge", "custard", "sponge", "whipped-cream", "marzipan" ] }
{ "_id" : "angel cake", "makeup" : [ "sponge", "whipped-cream", "sponge", "icing" ] }
{ "_id" : "brownies", "makeup" : "brownie" }

The first two contain an array for their multi-part “cakiness” in the makeup field. But the brownie, well.. it’s made out of brownie! The value of the makeup field there is just a string.

$unwind used to error out on this condition. If it encountered any document who’s field was not an underlying BSON type of array it would halt the pipeline and throw an error. But not anymore!

Running a straightforward aggregation:

db.cakes.aggregate([
{$unwind: {path: '$makeup'}},
{$group: {_id: '$makeup', makes: {$addToSet: '$_id'}}}
])
// results in the following:
{ "_id" : "marzipan", "makes" : [ "princess" ] }
{ "_id" : "sponge", "makes" : [ "princess", "angel cake" ] }
{ "_id" : "icing", "makes" : [ "angel cake" ] }
{ "_id" : "jam", "makes" : [ "princess" ] }
{ "_id" : "custard", "makes" : [ "princess" ] }
{ "_id" : "brownie", "makes" : [ "brownies" ] }
{ "_id" : "whipped-cream", "makes" : [ "princess", "angel cake" ] }

Is perfectly acceptable in the new unwinding world. When $unwind sees a field with a non-array value, it treats it as an array with that single value, and emits a single document containing that value. No more error. Since in the past such data condition would have produced an error, your legacy code should largely work OK under upgrade. If it didn’t error then this condition wasn’t present – presumably clean data or you took the time to fashion fancy $match clause or expressions that prevented single-valued fields from entering the $unwind stage.

Nice.

This feature adds nuances to a common scenario. Initially, a document schema contains only a single value because it is version 0.9 of the software, and the use case was simple. Later on, we discover that a single value doesn’t cover the future feature and convert a field to an array. But then we may have some limbo time when some documents are saved with one schema, and others with another. Either way, we can now aggregate across that field and handle single-valued fields as an array with single a value without the need for elaborate $project of $match.

Oh, and one more thing:

We may have a need to figure out how cake composition varies across cakes. For that, I want to aggregate around the ordinal position of a component across cakes. I’d like to have an idea across the layer 1’s, layer 2’s etc what’s the amount of variation. To assist with that, $unwind has a new option to emit the array offset alongside the value.

> db.cakes.aggregate([
{$unwind: {path: '$makeup', includeArrayIndex: 'offset'}},
{$group: {_id: '$offset', options: {$addToSet: '$makeup'}}}
])
// returns these results:
{ "_id" : NumberLong(6), "options" : [ "marzipan" ] }
{ "_id" : NumberLong(5), "options" : [ "whipped-cream" ] }
{ "_id" : NumberLong(4), "options" : [ "sponge" ] }
{ "_id" : NumberLong(3), "options" : [ "custard", "icing" ] }
{ "_id" : NumberLong(2), "options" : [ "sponge" ] }
{ "_id" : NumberLong(1), "options" : [ "jam", "whipped-cream" ] }
{ "_id" : NumberLong(0), "options" : [ "sponge" ] }
{ "_id" : null, "options" : [ "brownie" ] }

Which lets us see that our cakes thus far have sponge layers and a variety of fillings in between for the first 4 components (offsets 0 - 4)

When we evaluate our customer schema and use cases, we also spend time ensuring that we can produce requisite (and sometimes intricate) queries using available Mongo server syntax. The 3.2 release includes an improved $unwind and a bevy of other operators which helps us cover more use cases in a more compact and efficient way.

MongoDB 3.2 Goodies coming your way: Schema Validator

Mongo has a flexible document model, allowing you to pretty much save any data you want, any way you want, without any ceremony around declaring your document structure ahead of time. This makes some folks uneasy. After all, if there’s no schema structure enforcement, what prevents humans from making silly programming errors like naming a field “wrong” of storing a number as a string and not an integer? The data integrity or coherency surely can kick you in the rear binary area.

On the other hand, documents are really cool. The inherent flexibility lets us ingest arbitrary variety of data and deal with fields and sub-documents as first class citizens with rich query and update operators. This is not something we want to give up.

So what to do?

How about a compromise: For a given highly important collection, Mongo will let you declare a mini-spec for acceptable documents. A schema(ish) declaration which describes specific fields which must exist and which should be of a certain type. This is called a validator. A validator is a rule that helps you ensure that documents in a collection all adhere to that rule.

Let’s start with an example. Your company has customers. You do business on the internet. You figure every customer must have an email address. To ensure that, you bust open a Mongo shell, and create a validator like so.

db.createCollection("customer", {
validator: {
email: { $exists: true }
}
})

The collection named “customer“ is not created. A validator has been assigned to it. The validator simply states that a field named email must exist on documents stored in the customer collection.

Let’s try and add some documents:

> db.customer.insert({_id: 1, email:'[email protected]'})
WriteResult({ "nInserted" : 1 })
>
> db.customer.insert({_id: 2, name: 'iggy'})
WriteResult({
"nInserted" : 0,
"writeError" : {
"code" : 121,
"errmsg" : "Document failed validation"
}
})
>
> db.customer.insert({_id: 3, emailAddress: '[email protected]'})
WriteResult({
"nInserted" : 0,
"writeError" : {
"code" : 121,
"errmsg" : "Document failed validation"
}
})
>
> db.customer.insert({_id: 4, email: 42})
WriteResult({ "nInserted" : 1 })
>
> db.customer.insert({_id: 5, email: null})
WriteResult({ "nInserted" : 1 })
>

Line 1 shows that adding a document with an email field works. No problem, no surprise.

Line 4 shows that attempting to add a document with no email field fails. We expected that. it’s what we validated for: an email field must exist.

Line 13 shows that attempting to add a document with a different field that happens to contain an email is not allowed. Ok, makes sense. Mongo is not fishing around to find some other field. We just told it that the field email must exist!

Line 22 shows that our rule is a bit weak: we can have a document that has a field email, but with a numeric value. Will fix this in a bit.

Line 25 points another weakness: we can add a document that has the filed, but with a null value. Clearly, not our intent.

Well, with what we’ve learned from these tests, we want to improve our rule. What we want is an email field, that has a value, which is an email address. So let’s craft a validator for that:

var newRule = { email: { $exists: true, $type: "string", $regex: /^\[email protected]\w+\.gov$/ } }
db.runCommand( { collMod: "customer", validator: newRule} )

The new rule we created states that an email field must exist. It should contain a string data type (remember, BSON has precise data types knowledge, so we can ensure the exact type we want). The new rule also says that the email address must follow the email syntax stated in the given regular expression. The regex ensures that our email address indeed looks like a valid one. It also - by nature of requiring some characters – ensures that the filed is not empty. Null doesn’t match that regex. Neither will a number.

This time, I applied the rule to an existing collection. The collection already exists, so I can modify it and assign the new validator to it. This new validator replaces the old validator.

So lets try a few inserts then:

> db.customer.insert({_id: 6, email:'[email protected]'})
WriteResult({ "nInserted" : 1 })
>
> db.customer.insert({_id: 7, email: '[email protected]'})
WriteResult({
"nInserted" : 0,
"writeError" : {
"code" : 121,
"errmsg" : "Document failed validation"
}
})
>
> db.customer.insert({_id: 8, email: 42})
WriteResult({
"nInserted" : 0,
"writeError" : {
"code" : 121,
"errmsg" : "Document failed validation"
}
})
>
> db.customer.insert({_id: 9, email: null})
WriteResult({
"nInserted" : 0,
"writeError" : {
"code" : 121,
"errmsg" : "Document failed validation"
}
})
>
> db.customer.insert({_id: 10 })
WriteResult({
"nInserted" : 0,
"writeError" : {
"code" : 121,
"errmsg" : "Document failed validation"
}
})
>

Only the document with email address “[email protected]” was inserted. The document with the email “[email protected]” doesn’t match the pattern because it doesn’t end with “.gov”. A document with the field email containing a numeric or null value is rejected. And of course, a document with no email field at all is rejected. So our new rule works! It a bit overly verbose, we don’t really need the $exits and $type because the $regex criteria subsumes that meaning. But here I wanted to show how you can apply multiple rules to one field.

The matching criteria in a validator can be any of the query operators you use with the find() command. The exceptions are operators $geoNear, $near, $nearSphere, $text, and $where. And we can combine validation rules together with logical $or or $and to create more complex scenarios.

Here’s a scenario that aims to make sure customer documents adhere to some credit score and status rules:

var newRule = { $or: [
{
$or: [
{creditScore: {$lt: 700}},
{creditScore: {$exists: false}}
],
status: 'rejected'
},
{
creditScore: {$gte: 700},
status: {$in: ['pending','rejected','approved']}
}
]}
db.runCommand( { collMod: "customer", validator: newRule} )

We’re trying to ensure that

  • A customer with creditScore less than 700 or no creditScore at all is in the ‘rejected’ status.
  • A customer with a creditScore at or above 700 is in one of the 3 statuses: rejected, pending, or approved.

The validator itself is a bit more involved, including an $or combination which allows us to validate several alternative acceptable states of an acceptable document .

To test this, lets try a few inserts:

> db.customer.insert({_id:11, status: 'rejected'})
WriteResult({ "nInserted" : 1 })
>
> db.customer.insert({_id:12, status: 'approved'})
WriteResult({
"nInserted" : 0,
"writeError" : {
"code" : 121,
"errmsg" : "Document failed validation"
}
})
>
> db.customer.insert({_id:13, creditScore: 700, status: 'approved'})
WriteResult({ "nInserted" : 1 })
>
> db.customer.insert({_id:14, creditScore: 300, status: 'approved'})
WriteResult({
"nInserted" : 0,
"writeError" : {
"code" : 121,
"errmsg" : "Document failed validation"
}
})
>
> db.customer.insert({_id:15, creditScore: 300})
WriteResult({
"nInserted" : 0,
"writeError" : {
"code" : 121,
"errmsg" : "Document failed validation"
}
})
>

A customer with no creditScore field at all is possible, as long as the status is ‘rejected’. One with a status ‘approved’ but no creditScore field is invalid.

A customer with a creditScore of 700 and a status ‘approved’ is valid and Mongo accepts it.

A customer with a creditScore of 300 and a status ‘approved’ is not valid, since those with this low score should have a status ‘rejected’.

A customer with a low creditScore alone, without a status at all is not accepted either. By now, I updated the validator definition on the collection several times. Which begs the question: what about existing documents in the collection? The answer is : nothing. Mongo will not reject a new validator because of existing documents. It will not delete, fix, or flinch about existing document already in the collection. Mongo applies the validator to mutations only : updates and inserts. What happens when you update a document that already existed? That depends on the validationLevel you mark your validator with. When you apply your validator to a collection, you can add a validationLevel field, with a value of ‘off’, ‘strict’, or ‘moderate’. The default validation level is ‘strict’. Under strict validation, every update or insert must pass validation. If you use the moderate level, the validation is applied when inserting a document and when updating an existing and valid document. An update to an existing but invalid document doesn’t undergo validation.

Earlier on, we inserted documents that have no creditScore and no status fields. Under our latest rule with strict validation, we can’t update them unless they adhere to the current new rule. But if we change the validator and add validationLevel of moderate, we can. For example:

> db.runCommand( { collMod: "customer", validator: newRule, validationLevel: 'moderate'} )
{ "ok" : 1 }
// Existing document is
// {_id: 4, email: 42}
> db.customer.update({_id:4},{$set: {email: '[email protected]'}})
WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })
// Now modified document is
// {_id: 4, email: '[email protected]'}

The new validator assigned to the collection is the same as our latest credit score one. We just added the validationLevel field option to exempt existing documents from validation against the validator upon update. This lets me change the existing documents’ email field to an email string. Under strict validation level this would not be allowed because that document doesn’t have a creditScore field and no status ‘rejected’ either.

It’s useful to know what the current validator is (or any other collection specific setting). To do that, use the shell function db.getCollectionInfos({name: 'collection_name'}) like so:

> db.getCollectionInfos({name: 'customer'})
[
{
"name" : "customer",
"options" : {
"validator" : {
"$or" : [
{
"$or" : [
{
"creditScore" : {
"$lt" : 700
}
},
{
"creditScore" : {
"$exists" : false
}
}
],
"status" : "rejected"
},
{
"creditScore" : {
"$gte" : 700
},
"status" : {
"$in" : [
"pending",
"rejected",
"approved"
]
}
}
]
},
"validationLevel" : "moderate",
"validationAction" : "error"
}
}
]

Which gives you a nicely formatted account of the current validator expression and the validation level.

Notice the extra field validationAction set to “error”? The validation action controls what the Mongo server returns to a client that attempts to write something that is not valid. The choices are ‘error’ or ‘warn’, with ‘error’ being the default. As the names imply, an error is returned to a client if the action is ‘error’ and a warning if set to warn. To roll in a new validator, you might want to start with ‘warn’. When you use ‘warn’, a failed validation does not produce and error. The write continues and does not depend on validation. The validation failure will be logged into Mongo’s log file, but the write operation will not be prevented.

Applying a validation action is a matter of adding that field:

> db.runCommand( {
collMod: "customer",
validator: newRule,
validationLevel: 'moderate',
validationAction: 'warn'
})
{ "ok" : 1 }
>
> db.customer.insert({_id: 16, status: 'pending'})
WriteResult({ "nInserted" : 1 })
>

Here we apply our latest rule with a moderate validation level, and a validation action of ‘warn’. Under ‘moderate’ validation level, an insert will always be validated. But with a validationAction of ‘warn’, the write will be accepted and only a log entry will tell us that something went awry. The insert itself went fine as far as a connected client is concerned.

Before you declare complete victory on the side of corporate auditing and the proponents of schema lock-down, there’s something else you should know: A client performing an update or insert can ignore all this validation goodness easily. The all you have to do to circumvent validation, is say so. When a client adds a bypassDocumentValidation field with value true, the validator will not kick into effect and won’t prevent any writes in neither strict nor moderate levels. This goes something like :

> db.customer.insert({_id: 17, status: 'arbitrary_junk'}, { bypassDocumentValidation: true})
WriteResult({ "nInserted" : 1 })
/* log entry: ...Document would fail validation collection: validators.customer doc: { _id: 17.0, status: "arbitrary_junk" }
*

Which will produce a warning in the log, but not prevent the operation. So this whole feature should be deemed a collaborative tool to improve data quality, not an enforcement hammer in the hands of the person controlling the collection.

This is a pretty interesting new feature. It straddles the line between declared strict schema and flexible, software-defined document in a very interesting way. It can help people protect against some of the dangers of flexible document model, yet leaves open the ability to operate without the harsh schema restrictions that a pre-defined schema engine would have imposed.

If you are interested in learning more, let us know!