aggregate
command (and its predecessor find
) take some time to get used to.This is seen as a barrier to entry for those coming from a relational background as well as those who are early in the journey of MongoDB adoption.
With all the new AI and awesome natural language tools, can we get some help for generating queries? “Yes!” says the Compass team.
Announced at the .Local event in London, Compass is rolling out a natural language text prompt which translates your natural language input into a properly stated query using either the find syntax or an aggregation pipeline (MQL).
I ran some experiments to kick the tiers on this feature.
The following are prompts and results I ran against a collection which has documents like the one below. Documents have different field values. The fields in the sub-document contact
are sparse: a document may have one, two, or all of the fields cell
, email
, and fax
. These fields - if missing - are absent (not assigned null).
{ |
How How many Elmers are there?
[ |
Success. The query generated answers the prompted text as intended. Note that the field name
is nowhere in the prompt. The tool understood that we’re looking for Elmers (a proper name, plural) and figured it should look in the field name
. Nice!
What are the 3 most common names of prospects with the status “customer” who have an email in the system?
[ |
Correct. The AI understood the AND filter for documents with the status
set to “customer” and email
existing.It used the common trope for “top N by V”, which groups by the value of the field (name
) and produces a count per grouping, then $sort
s by the count descending, then $limit
s to the intended number.
I noted that the generated queries use the old-school {$sum: 1}
trick to produce the count. Understandable since Compass may be used on older MongoDB version that lacked the $count
group accumulator operator, but this is a bit more cryptic for the untrained when reading the resultant pipeline.
Let’s try some more:
Which customers have and odd id number, and a name that starts with D?
[ |
Nicely interpreted. The $mod
operator is a bit obscure, but fits the bill. The name is compared with a regex match on the expression /^D/
. Left-anchored regex matches are the lesser evil of non-anchored matches, and can use indexes. But maybe a range expression such as {name: {$gte: "D", $lt: "E"}}
would perform just as well?
I decided to change one little thing in the query: change from an upper case D (which is by the way how my data is - names are capitalized in the dataset) to a lower case d.
Which customers have and odd id number, and a name that starts with d?
[ |
Interestingly, the tool decided to add a $project
stage to the pipeline. Re-reading the prompt, I can see that I under specified which fields to return. But what caused the previous query to not $project
at all and this one to do so is a mystery to me. Not strictly wrong, but a mystery nonetheless.
A fix to this is to explicitly specify what to show.The prompt Which customers have and odd id number, and a name that starts with d? Give me their name labeled "Customer Name" and their id labeled "Customer Id" only.
indeed generated the expected pipeline.
[ |
The projection renames the fields as the prompt described.
My standing conclusion is to be as explicit and concise as possible with the prompt. I guess this should be expected.
How likely are people to have both a cell phone and a fax number?
[ |
Chalk this one to “not correct”. The prompt intent was to generate a ratio or number representing the chance or likelihood of a document to contain both fields. The generator understood the and condition correctly in the $match
stage. But it failed to produce an aggregation which compares the count of documents matching the criteria vs. those which do not.
Many people speak English as a second language. But many do not. How about typing in your native tongue?
I gave it a whirl with Hebrew. Not remotely a popular language, so I can’t imaging the team built specific support for it.
This query asks to find those who are named “יוסי
“.
מצא לי מי ששמם יוסי
{"name": "יוסי"} |
The generator correctly figured out the pluralized word for “name” in my query, and determined the field to search against is name
. It understood the value sought was “יוסי
“, and kept the text as is - again, correct. It did not try and translate the value into english (Yossi), but that’s a good thing. I would expect that if I knew the name was in english, i’d maybe type a hybrid question such as
מצא לי מי ששמם Elmer
{"name": "Elmer"} |
Compass did the needful and produced my expected query. A note here is that BiDi editing isn’t perfect… Hebrew is written left to right, and the textbox moved the word “Elmer” to the very “end” of the sentence from an English perspective, but a native Hebrew speaker would expect it to be on the very left, at the end from a left-to-right perspective.But the results were correct, and I won’t argue with “it worked.”
How about some aggregation? The Hebrew query asks how many people have an email but not a cell phone.
לכמה אנשים יש דואל אבל אין סלולרי
[ |
Correct, again. The Hebrew prompt used the acronym term for “email” (דואל) and the common word for “mobile phone” or “cell phone” which is “cellular” (סלולרי). Not only did it match the terms to the fields cell
and email
, it also figured out correctly that the fields are in the sub-document contact
and generated the dot-path to the fields.
I have reservations about using natural language to describe queries. For one, spoken language was not designed to be concise, and the range of misinterpretation that can be cause by using imprecise language as well as ambiguities within the language itself can yield wild results. I was surprised how well this new feature performed. Color me impressed.Not only has the generator created my intended queries, it picked up on nuances that I did not imagine it would.
Can there be test cases where it comes up with the wrong answer? Yeah, sure. But id performed remarkably well on what I gave it, and in doing so proved it is suitable for a wide enough audience for a wide enough set of reporting and analysis needs.
Am I going to use this daily? Time will tell. For simple quests? Maybe. For more complex ones? Maybe also - as I can see it creating a skeleton query for me to approve or refine.Does this replace learning and knowing the breadth and depth of MQL - no. But for newcomers and occasional analysis this may just do the trick.
Give it a try, see what you think!
]]>Wildcard indexes have improved in several ways in MongoDB 7.0.
Notabley, wildcard indexes can now be compound, including additional non-wildcard fields.
Previously, wildcard indexes only acted as a single-field index on each indivudual target field, stored in a single index which included those fields.
Newly supported is the ability to include both wildcard fields and named individual fields in the same index.
This capability supports common use cases such as the attribute-pattern
document design. In such cases, a sub-document contains a set of keys, but the presence of the keys varies from document to document.
Consider the case of the followin documents describing a prospect record in a CRM system
{ _id: 35, name: 'Francis', status: 'customer', contact: {} } |
cell
and a fax
.cell
phone number.A wildcard index on contact.**
would allow single-field match against one of the contact fields.
But query filtering on the status
field and one of the wildcard fields would not be supported without a compound wildcard index.
First, create an index by supplying both a wildcard expression and an additional plain (non-wildcard) field(s).
db.prospect.createIndex({ "contact.$**": 1, status:1 }, {name: "idx_compound_wild"}) |
As with other compound indexes, the key structure supports key prefixes that include the first, or first and second fields.
This query leverages the compound index:
db.prospect.find({'contact.cell':'07408 926850', status:'customer'}) |
The query includes the a full prefix in the index: a single wildcard field contact.cell
, and the non-wildcard field status
.
Running .explain(true)
on this query shows a winning plan that states our index named idx_compound_wild was used. The indexBounds entry shows that both the contact.cell
and status
fields were considered when processing the query agains the index.
"winningPlan": { |
Index intersection is a strategy where more than one indexed fields are stated in a filter, and the result is computed by perforing a set intersection on the keys of 2 indexes.
Intersection seems to make sense when two separate simple indexes are at play on two separate fields. What about wildcard index? Will the mongo query planner choose to “intersect” the same wildcard index with itself somehow? Do we expect it will be smart about choosing to scan for two values, one found in a certain path contact.cell
, then the second found in a different path contact.fax
- both of which are contained with the same index? The short answer: no!
Consider the following query:
db.prospect.find({ 'contact.email': 'madtiwvab@what.not', 'contact.fax': '0121 062 9173'}) |
The query filters on two separate wildcard fields email
and fax
.
The query planner shows the following output:
"winningPlan": { |
The above is a bit awkward. The filter
term mentions the contact.email
field alone. The indexBounds
field mentions the contact.fax
field only. Though one might intuit that the strategy is to us the index to scan the index contact.fax
first, then filter the entries in the index by contact.email
, it is neither clear that this is the case nor expected. One would expect that the index contains the list of documents under either of the keys, and therefore an AND query would hit the index key structure twice and do some nested loop join or something similar.
Let’s try specifying three of the wildcard fields in the index, and see if things become any clearer:
db.prospect.find({ |
The plan shown below seems to imply the same general strategy. The index bounds only mention one of the three query terms, then the filter mentions the other two.
"winningPlan": { |
Well, the plan is not lying. MongoDB does indeed choose only one of the query terms to serve as the value sought in the index. It evaluates the selectivity of the terms in the query, and picks one it deems most suitable.
This still doesn’t answer the question why it doesn’t compare the other terms agains the index either. Which brings the topic of index intersection. MongoDB introduced an implementation of index intersection at some point, then reversed the default implementation of it to a degree that pretty much prevents intersection from every being excercised. Turns out that the performance gains were rather low for the intersection candidates as compared to other strategies. Worse: the planner would often pick intersection in cases where other strategies would have performed way better. In other words: Index intersection turned out to be the wrong answer too often, and would actually hurt performance.
But this is a single index you say - wouldn’t intersection make sense in this particular wildcard scenario? The way I like to think about it, a wildcard index is actually several indexes in a trench-coat pretending to be a single index. Each wildcard field value is processed into it’s own index structure, but all of those structures are stored together as one. A hint to this notion is that the planner outputs the $_path
term, stating which wildcard field within the index is targeted.
The addition of non-wildcard fields to the index would then shape the key prefixes to contain the additional non-wildcard value so that the index is more selective and the (theoretical) need for intersection is reduced.
Compound wildcard indexes are a welcome addition to MongoDB 7.0. Having no intersection mechanism doomed many queries to only partially use indexes and resort to loading documents based on a single field, then filtering them further. With this feature, the number of documents loaded can be reduced. Reducing I/O should help many workloads. Whether it does for your workload is subject to - as always - experimintation and monitoring. Give it a try, hope it helps!
]]>For limitations and further details, see Compound Wildcard Indexes
This is often the case when it comes to network access security: we are quick to open up network access “just for now”, but fail to close things down later. This increases the exposure surface into the protected resource (A MongoDB Atlas cluster for example) indefinitely. But it doesn’t have to be that way.
TL;DR: Use expiring network access permissions for IPs instead of permanent ones.
For a while now, Atlas had a simple measure that helps clean up temporary access grants. Instead of the network access permission lasting until you explicitly delete it, you can set it to expire automatically and get removed.
Flip the toggle at the bottom left to enforce a time limit on the access grant. Pick a time frame: 6 hours, one day, or a week - the shorter the better. Atlas will automatically clean up the entry and close the IP access after that duration elapsed.
Some might ask “But what if I want to keep it open longer?”
Here’s my cheat-sheet:
IP Address Space | Access Type | Auto-Expire? |
---|---|---|
A. Private (10. …, 192.198. …) | Production Application | No |
B. Private (10. …, 192.198. …) | Interactive | Yes! |
C. Public IP / Block | Interactive | Yes! |
D. Public IP / Block | Production Application | Yes! |
This cheat sheet is of the paranoid opinion “Don’t allow public IP access to Atlas.” But paranoia good for security.
Here are some supporting thoughts. They all share the following core belief:
Sacrifice convenience to gain security
With that in mind, lets talk about interactive vs. application access.
An interactive user is a human running MongoDB Compass, or the shell , or program during development.
This is the danger zone. “I just need to do my job now“ motivates allowing such traffic. But tomorrow is not now anymore… better expire it.If you are the grantor (you have an Atlas role allowing you to add the network), then you can just as well grant yourself permissions again after expiration. Sacrifice convenience.
If the interactive user does not have access-control privileges, the nuisance of provisioning access is higher. But then the question arises: why is access to Atlas done over the public IP?
Why expire an interactive IP anyway? Well, because it provides another surface for would-be attackers. Interactive users have potential OS exposure to compromise via channels such as email, malware, random downloads and just plain old neglect.
An application is a program or process running unattended.
Seek to route traffic via either peering or a private link. This would turn the traffic into non-public traffic, and therefore fall under use case #1.
The alternatives to auto-expiring automatically existed for a while. Shun the “I’m personally responsible so it’s ok for me to have permanent access” argument.
If the DBaaS or DB doesn’t auto expire, a good sys-admin or DevOps practice would automate tearing down old entries. Still valid: Atlas management API allow for creating and removing network access, so it’s a viable option.
Manual “expiration” entails manual auditing, and manual removal of such entries. This is both error prone and less reliable in general. But if that is all your organization is willing to do for security - so be it. Chalk that one up for “acceptable within parameters”.
As you surmise by now, the approach in this post strongly shuns permanence, especially for public IP space.
What about permanence within private space? Those should also be expired! But in this case, automatic expiration is a bit trickier. A production application loosing connectivity is a risk not many would take. There can be arrangements that either audit and expire connections with no observed traffic, or which renew/ re-provision access on a cadence as long as some sign of life is detected. But such automation is a chance for error with higher risk to availability of the production application. So letting practicality win here: revert to manual auditing of such entries. Yes: they are internal traffic, but still another open vector.
Bottom line: Your data is priceless, consequences are high - keep it as safe as possible. Sacrifice convenience to gain security.
]]>Schedule periodic rollup and aggregation on historic data using $merge
into a collection, then run ad-hoc reports on cumulative live data for the current period, $unionWith
historic data.
We often need to produce reports summarizing sales over time which include both current partial period, such as “sales this year to date”. While it’s perfectly fine to run ad hoc queries on the data set, this can become a performance issue when report frequency increases or data sets become extremely large or the periods are very long (eg: all sales to date since inception).
It’s been common practice to pre-aggregate historic data and persist it to some collection. This lets us query only the most recent yet-to-be-rolled-up period, and add it up to historic data queried from the other cumulative collection. Trivial concept, nothing terribly new here. So where’s the catch? Details (it’s always the details…):
sequenceDiagram autonumber participant a as Query participant ms as [monthlySales] participant s as [Sales] a->>+ms: January - July? ms-->>-a: <> a->>+s: Current Month? s-->>-a: < > a-->> a: Union
The implementation of this technique relies on 2 high level operations: the
The technique used with MongoDB to persist historic data used to be mapReduce
in the old days. Very old days. Using mapReduce
allowed you to persist the results into a collection all server-side. Before or without that, you’d need to have client code computing a result-set, then turning around and writing it explicitly back into MongoDB.
With introduction the Aggregation Framework, more efficient query operators were introduced, as well as the $out
pipeline operator. The issue with that though, was that $out did not play very well with sharded clusters, and did not allow for incremental data insertion. Instead, $out wiped the output collection clean and replaced its content with the new data. This implied that for a long history, you would have to compute again over a growing number of documents - not very efficient, and exacerbates the problem we were trying to avoid in the first place. You could of course run a query just across a year say, and store it into a yearly collection. But then the problem of summing up across years would require more and more collections be queried independently by the client, and you’d end up with many more collections just to handle this case. Not elegant.
Now with $merge
the story becomes manageable again. $merge
lets you perform incremental updates to an existing collection, so your historic data can largely remain save for the newest addition or update to the latest period.
Collection sales contains document-per-sale with a field total and date in the example below:
{ "date" : ISODate("2021-08-10T23:54:01.028Z"), "total" : NumberDecimal("178.4349") } |
To produce sum of sales, count of sale numbers across August, we just run an aggregation, producing the result into the monthlySales collection
db.sales.aggregate([ |
The shape of the document in monthlySales is verbatim the document(s) produced by the predecessor pipeline stage of $merge
:
// db.monthlySales.findOne() |
What happens though if it is still August? A previous value for August may exist already. We have to address what would be done in case a document with the same _id
is not present, as well as what do do if it is.
When a document is not present, $merge
defaults to inserting it. But you can also tell it to discard the new document, or produce an error. For example, you may want to prevent accidental creation of documents in the target collection in order to prevent arbitrary items created unless some approved well known document already exists. The full universe of choices is expressed via the whenNotMatched
field with value being one of insert
,discard
,or fail
. Note that in case of fail
, other documents may have been inserted so the state is kind of non-determinant , it is not wrapped in a transaction.
This example shows discarding silently and documents that $merge
found no match for in the target collection. Someone would have had to create a document for August 2021 otherwise the result for August is effectively ignored by $merge and nothing will be written.
{$merge: { |
When are document is matched, you have a few more choices. The default is to replace the existing document. The full list of options: replace
, keepExisting
, merge
, fail
, and pipeline
. You can choose to keep the existing document to preserve things as they were. Replace and fail are pretty straightforward as well.
Specifying whenMatched: 'merge'
will slam fields from the new document into the existing document.
Document | Value |
---|---|
Existing | {_id:1, name: 'bob', age: 33} |
New Document | {_id:1, age: 34, height: 1.74 } |
Resultant Document | {_id:1, name: 'bob', age: 34, height: 1.74 } |
An existing field’s value is replaced and previously non-existing fields will be created. Fields not present in the new document which existed before are left as is.
That’s good for having multiple separate calculations affecting some destination fields. But for incremental update, we’ll be looking to combine the value of the existing field in the target collection with a value from the new calculation on the origin collection.
Given the document in an origin collection, and a corresponding document already in the destination collection:
db.origin.insert({ _id:1, x:2}) |
We can now add the existing value in the destination‘s document field x to the sum of that value and the value of the field x from the preceding pipeline operator ($match for simplicity here):
db.origin.aggregate([{$match: {}}, {$merge: {into: 'destination', whenMatched: [{$set: {x: {$add: ["$x","$$new.x"]}}} ]}}] ); |
Using the special pipeline variable $$new
, we can distinguish the field x from the existing field x that was in the destination document.
The pipeline constructed to produce the merge of existing documents can only take one of the stages $addFields
, $set
, $project
, $unset
, $replaceRoot
, or $replaceWith
. This mini-pipeline only ever sees a single document input from the destination collection - the one that matched - so other pipeline operators really don’t make a lot of sense. Only single-document-modifier types are allowed - or at least that’s how I make sense of it.
With all this syntax background, let’s turn back to our original problem. We want to have cheaper queries over historical periods yet combine them with recent on-the-fly values. Storing the past data can be don in some frequency, either monthly or weekly or daily. Given that we can incrementally update the destination collection, we’re free to pick a frequency that makes sense.
To combine documents for a partial month, we’ll need to query 2 collections though. The historic sales in the monthlySales collection, and the results from the month-to-date query on the fly from sales. Would be nice to do it in one command to Mongo. Historically, we would shoot 2 queries and combine the results into one client-side. But that’s extra code that someone had to write, and makes populating generic graphs more involved.
With the introduction of $unionWith
, a single aggregation command can now return results from more than one collection! Harnessing this for our scenario we’ll query the “main” collection sales for the current month-to-date, then union the results with a pipeline over the monthlySales for already computed document since the beginning of the year:
db.sales.aggregate([ |
Mic drop. A single query processed all on the server side, giving us everything we need in one shot.
Which leaves us with the small matter of granularity. How much of the work do we want to do on the fly vs. just digging up the computed data?
If we populate the monthlySales monthly after the edn of each month, we’ll need to query up to 31 days worth of data on the ad-hoc side. If we update the sales daily we’ll have less than a day’s worth to sum on the fly, with the onus of determining exactly what cutoff window we set for the current vs the existing already merged data. Whatever we choose, we’ll just need to make sure our boundaries for the time window don’t overlap so we don’t count the same sale too many or too few times.
To round things off, you might want to start accumulating pre-calculated data but already have a slew of individual documents in your live collection. You could write some script client side to iterate the months and years of the past. Or you could run a query to populate the monthlySales like so:
db.sales.aggregate([ |
There are more nuances bells and whistles that come along with $unionWith
and $merge
, but I think with the above techniques we can already achieve a good portion of the historic+current type scenarios. You will of course need to have a recent MongoDB version - $merge
was introduced in 4.2, and $unionWith
in 4.4.
Time series data is a fairly common use case and has been “solved” before. Indeed, in blog posts around the net as well as in my course on schema design and data modeling, a common approach to storing such data was discussed at length.
But not everyone knows that bit of common wisdom. Further, it’s easy to get caught up in the storage details and make silly little mistakes that wreck the exercise.
Someone in MongoDB had the brilliant idea to simplify this further (I should by them a drink of their choice I find out who it was). Instead of you painstakingly bucketing each “reading” or time series item, then un-bucketing it or peppering your queries with time-bucket-boundary match clauses - why not provide a new collection for time series which does the tricky bits for you?
As far as the client is concerned, every document is a stand-alone point in time item which has a timestamp and some measurement - a reading - composed of one or more fields recording your data. If you insert a document, you insert one per reading. You don’t have to push into arrays, or compute any time-window or anything like that.
When reading a document from a time series collection, each point in time measurement returns as a single document - not a bucket with an array of readings withing a window.
It appears largely un-magical, until and unless you compare it to what you had to do manually before this existed.
The before document was a bucket with multiple readings, so for a 12h bucket and say 4 temperature readings in the field temperature
in a day you would see something like:
|
Using Time-Series Collection, the readings are presented to you individually:
|
You will note that the bucket granularity is nowhere in the results! You may have chosen to bucket spanning seconds (the default), minutes, or hours. But that is a storage choice, and your queries and users don’t need to care. This awesome arrangement lets users just be users. Insert readings individually as the occur, read and summarize over time as if they were stored individually. On the surface this matches the business expectation and mental model of periodic measurements written or read as they are. However
The internals are what’s nice here. The “collection” presented to the user (sensings
in the example above) is actually more of a view provided over internal collections that physically exist but you do not directly interact with them. The backing storage collections are organized to bucket the individual readings into some fixed time window size. If you choose an hours granularity, there can be up to 1 bucket per hour. Multiple readings within the same hour would all fall into the same bucket, and next hour’s readings will fall into the next bucket. Except now you don’t worry about structuring your document in any special way - Mongo does it for you!
At the time of writing, the internal physical backing collection is created when you declare your special Time-Series collection. For my example of the collection names sensings
Mongo has created a backing collection system.buckets.sensings
and stores physical documents under that namespace, in buckets.
show collections |
Oh, yes: you have to declare your special collection. If you have Mongo implicitly create a collection by inserting data into a yet-to-be-created collection, Mongo will assume a normal collection. The syntax requires one thing only - the timeField
. I like to include also the metaField
and granularity
parameters, for reasons expanded below.
To create my sensings collection and store at an ingestion rate fitting in hours long buckets issue the following command:
db.createCollection("sensings", {timeseries: {timeField: "timestamp", metaField: "info", granularity: "hours"}}) |
The collection named sensings
will be created (as well as a corresponding system.buckets.sensings
behind the scenes). It treat the field named timestamp
on your inserted documents as the time for bucketing purpose. It will let you also add extra non-measurement data or “meta” details in the field named “info”. The collection will optimize the bucket in granularity consistent with multiple items in a matter of hours. For high ingestion rates you should choose minutes
or seconds
. The
What is the business with the metaField
and granularity
about? It lets Mongo optimize the bucket sizes internally and organize storage efficiently. The theory behind creating time buckets is that:
When doing things manually we could shove the metadata extras such as sensor-id or patient id in any field. But for Mongo to understand that the metadata is a label and not a metric, and to enable it to create buckets specific to groups of readings we have to tell it where we store the metadata. In our case here, I used the info
field. In that field, I can store a single value, or several different fields. This way when I query for temperature average for a particular sensorId
, since it is written inside the info
field, Mongo knows that and can load buckets for that meta only. It should try and create a bucket for each time-window and for each meta value or combo of values. That way, query for sensor 123 doesn’t need to read buckets for sensor 456.
But the magic ends here as things now and Mongo does not automatically create indexes on the meta fields
. It leaves that up to you. Mongo does create indexes on the bucket boundaries timestamps on the backing collection, but that is usually no enough for prime time. You will no doubt have some narrower interest in slicing the time series by some of the meta fields. I’ll add a compound index on my sensorId
like so:
db.sensings.createIndex({"info.sensorId": 1, "timestamp": 1}) |
My query can now look to get average temperature across the devices on a cartain day, and leverage the index I just created:
db.sensings.aggregate([ |
A query chased with an explain shows the index is used:
db.sensings.find({"timestamp" : ISODate("2021-09-01T00:18:00Z"), "info.sensorId" : 123}).explain() |
The inputStrage section shows an IXSCAN
leveraging my index named info.sensorId_1_timestamp_1
.
What is interesting to compare, is the index definitions on the nominal collection sensings
versus the internal collection system.buckets.sensings
.
The nominal collection shows what you expect: an index on the direct structural fields timestamp
and the embedded info.sensorId
:
> db.sensings.getIndexes() |
But the backing collection shows a different picture:
> db.system.buckets.sensings.getIndexes() |
The 2-field index I specified to create was actually created as a 3-key field behind the scenes. The timestamp - a single field in my nominal documents - was translated into a bucket boundary defined by the control.min.timestamp
and control.max.timestamp
fields. Makes sense. The raison d’être of this whole thing is exactly that: that your nominal collection facade deals with individual reading documents, but the physical storage of the documents is something entirely different and more efficient than the facade lets us know. For a lesson on how the more efficient document is actually structured, I inserted a few documents:
db.sensings.insertMany([ |
Then I chased it with a query on the backing collection system.buckets.sensings
:
> db.system.buckets.sensings.findOne() |
The control
fields describe the bucket itself. They provide boundary info, minimum and maximum timestamps for the bucket. The meta
field contains my fields that I told Mongo I’d store under the original info
field in my inserted documents. In my case, a single field but there could be more. It would make little logical sense for Mongo to have a bucket spanning multiple sensors. So Mongo will find or create a bucket for every differing meta values for a given time period. Finally, the data. Fields not originally contained in the info
field as a sub-document and fields which are not the timestamp itself are stored under the data
field in backing storage. This includes the temperature
- as expected. It also contains a corresponding timestamp that the measurement was taken at. Since every document has an _id in Mongo, the _id of the reading is also stored here. The correlation among the triplet of values is positional, so the first reading in the bucket is spread across data._id[0]
, data.temperature[0]
and data.timestamp[0]
, the next reading in the bucket is ...[1]
and so forth.
Back to my aggregation from before: db.sensings.aggregate([{$match: {timestamp: {$gte: ISODate("2021-09-01"), $lt: ISODate("2021-09-02")}}}, {$group: {_id: "$info.sensorId", avgTemperature: {$avg: "$temperature"}}}] )
. Nowhere did I include any positional magic or unwinds or extra filtering logic. MongoDB did the work for my taking my straightforward mental model of the nominal collection - one reading per document - and did all the heavy lifting. Life is good.
In the course of writing this post and poking around, I found this bit interesting: when you run collection stats on the nominal collection you get a new section named timeseries
:
> db.sensings.stats().timeseries |
The new section gives some clues that the optimization and salient metrics to watch have to do with buckets, their management in terms of “opening” and “closing” them with respect to aspects such as the metadata, the number of items, the total size of the bucket, the time boundaries, as well as memory pressure. Pretty interesting, but I have no insight or clue how further to interpret these numbers. Time will tell (rim-shot, exit stage).
Whats next for Time-Series? I don’t know. Now that the abstraction and use-case has been introduced, the team can presumably focus on even more efficient storage or ways to process time-series queries more efficiently. Judging from the past, this is the way things unfold: a feature is introduced, which “makes it a thing”, then future iterations are taken to make it even better and more useful. I for one see this as a great new feature.
]]>Fine then: the document structure - the schema if you will - is still sound.
Where am I going with this? If the document is not broken, then you should not “touch” the whole document. You should use a surgical update, one that only sends a command to modify a field inside the document.
In MongoDB, the update
command takes 2 forms:
In the Document Replacement form, you supply the update
command with a full document value which will replace the whole document in the database. Doing so will logically target a single document (by _id
or some alternate key) and supply the future document in its entirety.
db.peeps.update({_id: 'superbob'}, {name: 'bob', likes: ['coffee','bacon']}}) |
The update
above slams in a new document with only the _ id, name, and likes fields remaining. Beyond losing any previous document field values, this update also assumes you intended to replace both the name and the likes. This may be true. But usually it’s just that in order to update one thing such as add ‘bacon’ to the list of things ‘superbob’ likes, you had to include previous unchanged values such as the name. Two problems here: One that you need to read the value from the database ahead of the update, second that the value you just read may be stale - updated by someone else - by the time your write back into the database.
The second - and my preferred - way to update a document is using surgical updates. If all I want to do is add ‘bacon’ to the list of likes, I can issue the command:
db.peeps.update({_id: 'superbob'}, {$addToSet: {likes: 'bacon'}}) |
This form will
Which brings me to the point of this post: What about REST?
REST API use HTTP actions to represent the action taken. The usual suspects are GET, POST, PUT, DELE. The least controversial one is DELE probably. GET gains some bad-boy street creds for whether queries should be solely specified using the URI path or query variables. POST and PUT generate lots of lengthy distinction discussions to decide whether PUT should or can create original objects and whether in the context of databases it is permissible to return the value of the created or modified object in full or at least the id or URI. Lots of discussions. But my focus here is on update. I’d like to avoid having to query the current document value, and I’d like to only issue a change for a subset of the fields on my entity, umm, document.
HTTP’s PATCH method aims to do exactly that. It lets a caller supply a future state of some of the fields, against a background of the existing, current entity. The docs state:
“The difference between the PUT and PATCH requests is reflected in the way the server processes the enclosed entity to modify the resource identified by the Request-URI. In a PUT request, the enclosed entity is considered to be a modified version of the resource stored on the origin server, and the client is requesting that the stored version be replaced. With PATCH, however, the enclosed entity contains a set of instructions describing how a resource currently residing on the origin server should be modified to produce a new version.”
This has “Surgical Update” written all over it, and maps naturally to the intent of the surgical update. And with MongoDB, some operators are idempotent such as $addToSet
, $set
, $unset
.
Where’s the rub?
PATCH is not always implemented as surgical update. Some API frameworks don’t explicitly support PATCH, and others implement PATCH as a PUT (read: “slam” semantics) instead of the intended surgical semantics.
Some, like Mongoose, support the surgical update semantics explicitly with the Document.prototype.update()
function, or by collecting single field assignments to an existing document and issuing a concise update. Careful here! * Some frameworks will just load the document first even when PATCH is supposed to be able to handle things in a more efficient way. Other * blogs might advise you to build it yourself in the same way. This is because mongoose by default is trying to entity-track the state of the document in memory then do the math to update or create the document backing the entity.
The culprit here is not frameworks actually - it’s the consumer. Uninformed or under-curious consumers of the frameworks will be causing extra back-end round trips, concurrency issues, or even data loss. And though there are applications that would not be horribly negatively impacted by these nuanced issues, but I don’t like to risk it.
Resolving this requires that you build an update outside of the document instance itself. Create an update command and issue it directly to the underlying database. If using something like mongoose, this means calling updateOne()
on the Model level, something along the lines of the code below. It does not have to find()
nor save()
anything ahead of updating the document and therefore saves that round-trip. Consequently also makes no assumption as to the current values in other fields, and would leave any un-mentioned field values intact.
// ... in your handler code: |
This handles the simple cases. To build more elaborate manipulation such as adding/removing array items or nested fields, you would want to explore more expressive PATCH parsing so that nuanced intents are clearer. You may want to take a look at jsonpatch.js for inspiration.
Why doesn’t
update
clarify the semantics? AFAIK: For historic reasons. To remedy this, official MongoDB drivers expose a replaceOne command (exact naming depends on language) which more precisely describes the operation when you intend to slam the existing document, leavingupdate
andupdateOne
to convey surgical updates. This makes reading programs better, but does not prevent you from issuing a slam using update, so you will need to take care of it manually, especially since theupdate
signature still allows for “slam” values to be provided.
Hopefully, this post revealed nothing new. But if you are unsure how your REST API maps to backend DB mutation commands, it’s worth taking a look.
]]>* Examples provided to illustrate the topic of this post only. They are not in any way meant to offend or diminish the contribution of respective referenced codebase authors.
Did you know that MongoDB lets you index a specific positional element in an array? Well, it does! I stumbled across this feature when reading the source code for mongod
for an unrelated feature exploration, and stumbled upon this great feature. This feature can help you create more efficient indexes for specific scenarios.
What is that good for you ask? Consider the following scenario:
Your company makes widgets as kits. Each widget is made of parts - each of which can be also sold separately. When a customer buys a “product” they are actually buying a box of items - a list of parts where the first one is by convention the “widget” and the rest are the parts of that widget. The schema for the things collection might look like this:
{ |
The schema above has a field kit_sku
which contains a unique selling id that folks can order. The parts
list all items in the kit. The company sells other kits featuring the same widget just with differing extra parts. For example, there might be an obscure manual in Klingon for one kit, and a different kit with instructions in ancient Phoenician (you know, because somebody called and complained that they need one).
The catalog application needs call up the widget “widget a”. Except, it doesn’t really. It knows that the widget is sold under the sku abc in a variety of kits and that the widget is by convention the first item in the array. So the kit_sku
is not useful here.
The query could look something like this:
db.things.find({ 'parts.0.sku': 'abc' }) |
This is a frequent enough query, so we’d want to support it with an index. But while indexing on 'parts.sku'
will produce a usable index, it is not necessarily optimal. Consider that you have a long list of parts for each kit across a large collection. The index would have to contain a key for every part, and point to all documents. This is both wasteful and misses the point. It is wasteful because memory and disk will need to contain entries that are not useful to the user - Only the first element in the parts list is what we’re looking for ever. It misses the point because of the same reason. We create this index only to satisfy queries that are interested in the first element of the array, not any part in the list.
You can create an index that includes a positional path though. Here:
db.things.createIndex({'parts.0.sku':1}) |
The index create above includes the positional marker 0
, telling mongo only to index the sku
of the first element of the parts
array!
Given the index above, running our query will make use of this index, as long as we use the path to the first element.
db.things.find({ 'parts.0.sku': 'abc' }).explain() |
The query plan looks something like this:
"queryPlanner" : { |
The plan above shows a few reassuring feature. First and foremost, an index is used for our query as evident by the winning plan’s "IXSCAN"
indicator. This is good.
Second, note the "isMultiKey"
reports false
. When you index an array field, MongoDB creates a multi-key index. A multi index index includes a pointer to the document for each element in its array hence the name. Non-multiKey indexes will only contain one pointer to a given document from a single key, since the indexed field contains only one document. This affects query processing because with multiKey index, mongo needs to work a bit harder to eliminate duplicates. TL;DR : when you use the positional index on an array element , it is not a multi-key index.
Third thing to note is the path. The index will be useful for queries that include the specific path "parts.0.sku"
. Other queries that do not include the positional path part will NOT be using this index:
db.things.find({ 'parts.sku': 'abc' }).explain() |
The path used in the query above did not target the first positional element in the array, and therefore the query planner chose a full collection scan "COLLSCAN"
. It is not correct to use the positional array index because the general query above asks are there any elements in the array with sku “abc”. The positional index we created doesn’t include pointers based on all elements, so it would be “missing” documents if mongo was to use the positional index for this query.
If your query targets a specific known positional array element, this indexing strategy can buy you a performance boost. This index will be smaller than one on a whole array and will be a non-multiKey one, reducing the work the server has to do to process your query.
MongoDB’s support for this and other interesting indexing strategies is pretty phenomenal. This gives DBAs and developers finer surgical indexing capabilities to support a wide range of schema and query scenarios.
]]>$unwind
is the aggregation pipeline stage that you most likely need. $unwind
lets you un-pivot array elements, so that you can perform specific operations on each (or some) array elements independently. In relational databases, we used to need to pivot rows into columns. Say we had and IoT readings
table like this one:
| Device | Measure | Reading ||— |— |— |— || 1 | ‘Temp’ | 42 | 1:00 || 1 | ‘Speed | 91 | 1:00 || 2 | ‘Speed’ | 94 | 1:00| 2 | ‘Temp’ | 33 | 1:00 || 2 | ‘Temp’ | 37 | 2:00 |
And we want a query to return results such as:
Device | Speed | Temp | Time |
---|---|---|---|
1 | 91 | 42 | 1:00 |
2 | 94 | 33 | 1:00 |
2 | 37 | 2:00 |
We can achieve that with a SQL pivoting function. The need to pivot stems from the model in which each “soft property” in the data table. We didn’t want to create a column for each measure, and we didn’t want to add a column to the table each time a new measure is needed.
In MongoDB (and any other document oriented database), the model for a single reading
document might look more like the one below. Each document is a point in time that contains a bunch of measures. If ever a new measure was added - no problem! An extra element can be added to the array with no schema change. If ever a reading had more or fewer readings at a point in time - no problem either. The array contains a set of readings which can grow or shrink.
{ |
With this document model, we would not need to do anything special for the report of readings by hour. The document already contains relevant measurements for each time period.
But if we wanted to calculate the average temperature across the devices, we’d need to do some digging. We don’t want to average every readings.value
field - only the ones labeled 'temp'
.
Using the aggregation framework and $unwind
we can do this in 3 steps:
readings
array.'temp'
.$avg
on the values.Let’s do this step by step. First, just unwind the array
db.readings.aggregate([ |
The result of $unwind
is that for each original document’s readings
array element, a new document will be produces that contains all the fields from the original document, but only one element assigned to the value of readings
. As you can see above, the data type of the field readings
after $unwind
became a single-valued field rather than the original array. Device 1
had 2 elements in the original document, so in the $unwind
output we have 2 documents, each attached to one element from the original array. One document contains a single reading for the measurement named speed
, another document contains a reading for temperature measurement named temp
.
Next step - we want only the temperature readings, getting rid of any speed
or other measurements that may exist. We can do this with the $match
pipeline stage. $match
filters documents and only emits ones that match the criteria supplied - just like find()
does.
db.readings.aggregate([ |
This got rid of any measurements that were not labeled temp
. With temperature only values in hand, we can now calculate average temperature.The $group
pipeline operator will let us calculate an average for each device:
db.readings.aggregate([ |
The behavior of $unwind
evolved over time, matching common usage patterns and expectation the community found useful. Several scenarios and special cases arise when unwinding arrays:
All these need to be handled in some way, and understanding the behavior would help you avoid fluky results.
Since MongoDB version 3.2, $unwind
treats a target field containing a scalar or non-array value as if it were a single element array containing that one non-array value. These two documents are treated the same way by $unwind
:
{ |
Device 3 above has a single reading in the readings
field - not an array. Device 4 has a single-element array for readings
. Since $unwind
treats them the same, we don’t neat fancy $project
and $cond
checks to normalize input documents (Squee!). If we added these documents to the readings collection, and ran the same 3 stage pipeline as before, we’d get the results we expect. Prior to mongo 3.2, a non-array field handed to $unwind
would have produced an error.
db.readings.aggregate([ |
Now what about a missing field, an empty array, or a field containing null? Here are examples of these documents:
{ device: 10, time: ISODate('2019-01-31T06:00:00Z') } |
If we ran $unwind
on those documents, none of them would be emitted past $unwind. The example below passes documents with device
== 10
to $unwind
, but since none of them had a “useful” readings
value to attach to an output document - none of them got emitted:
db.readings.aggregate([ |
This is just a behavior though. the word “useful” in our case where we won’t want an empty or missing reading to pollute our averaging math is our choice. Other use cases may exist where you actually want every document entering $unwind to be emitted forward even if it doesn’t have an array element to contribute. If you want that behavior, you can use the option preserveNullAndEmptyArrays
set to true
. To supply this option though, you would need to use a slightly different argument to $unwind
. Until now, we gave $unwind
a bare value-of-field argument $<field name>
. The more elaborate syntax is {$unwind: { path: '$<field name>'}}
. The following two forms are identical in meaning:
{ $unwind: '$readings' } |
The latter expression is where we can add our preserveNullAndEmptyArrays
option. Putting it all together looks like this:
db.readings.aggregate([ |
This time, a single document was emitted corresponding to each of the incoming device 10 documents. For each of those 3 documents, MongoDB considered that the incoming document has to be emitted out at least once. But what value should the output document have? In the case of the document containing readings: null
, mongo emitted a null in the readings
field. For the ones with missing field and with the array containing no value, $unwind
emitted a document with no readings
field at all. The notion is that a missing field should be preserved - no invention of a non-existing field. As far as an existing but empty array - think of it as “a document with no value there”. Emitting a field would have required a value to be assigned to the field. The value null
is still a value and is not the same as not having any value. If mongo would have emitted a {readings: null}
field in response to an empty array that’s like saying that the original document had an array element null
- which is not the case. Bottom line: the behavior is defined in a certain way. We can leverage it when we need to, and we can also manipulate the fields further using other pipeline operators.
The three-step pipeline to calculate average is not very difficult to read, understand, or produce. But the real need is expressed in the $group
stage. The stages for filtering and unwinding were cruft that adds verbiage - but not for nothing.
Tempted to shorten the syntax, one might think to use some of the projection operators, such as $filter
or $arrayToObject
. The $filter
example would go something along the lines of:
db.readings.aggregate([ |
Where we can pluck only the 'temp'
elements out of the readings
array. This works, but must be preceded by something to ensure that readings
is actually an array. Unlike the more forgiving $unwind
, $filter
insists that the input field is an array or it produces an error. That’s why I included a $match
that only passes along documents where that condition holds true (See BSON spec for type information. 0x04 is Array). This seems almost palatable, but would still require more syntax to transform the one-element array to something we can use directly for the math. So since the amount of text required to achieve the goal is not reduced, and the overall pipeline does not become more readable, I see little reason to go to those lengths.
$unwind
is an important pipeline stage in the document-oriented world. Dealing with arrays and aggregation will undoubtedly get you using $unwind
sooner or later. In a previous post we focused on data scenarios where this all made sense. In this post, we focused more on the balance between syntax and the meaning of the stage and related operators. Either way, it’s well worth learning the nuanced behavior of this key pipeline stage.
But for applications spanning multiple entities with broader usage pattern this could prove cost prohibiting: Each document type or entity would have naturally been designed to go into its own collection. But with the minimum price for the smallest collection around $30 a month for 400RU, an enterprise would think twice before designing an application that used Cosmos as a full back-end because low-usage collection data would start adding up to the solution cost. If nothing else, this pricing strategy flew in the face of the code-first benefits which document oriented databases promise.
Worst, this pricing lead some to use a single collection for all entities of all types. While that move saved money by not over-provisioning collections for small infrequent documents, it fosters no separation of concerns and did not promote separation of concern in code as microservices would have otherwise.
This is all in the past!
Cosmos DB now lets you provision database throughput at the database level. Collections created under that database will then share the cost based on that provisioned RU level. You can still provision specific collections with specific RU to meet your needs. But if you do not, then all the collections in that database will share the provisioned throughput.
To do this, you will start with creating a Cosmos DB account. Once an account is created, you can go to the data explorer and then hit the “New Database” link at the top.
In the dialog, type the name of your database, and check the “Provision throuput” checkbox.
This will reveal a secondary input where you can enter the desired throughput. The minimum provisioning is 400RU, but the default number popping in that box is 1000RU. Choose the throughput level your application demands - that’s something you have to determine on your own.
That’s it! Collections created under the shared throughput will share the cost and throughput at the database level. No more kludge code to force all documents of all types into one collection. No more skimping on collection creation for low-volume / low-frequency data to save a buck. Sharing the throughput let your architect and code your application without weirdness coming from the pricing model.
Kudos to the team that made this possible.
]]>TL;DR - MongoDB 4.0 supports multi-document transactions!
When you have 2 mutations (write operations), each affecting 1 document, Mongo used to apply each write as an independent commit. Consequently, there was a point in time in which mutation 1 was applied but 2 was not. If 2 failed for whatever reason, 1 was still applied.
This behavior caused some pain in attempting to manage all-or-nothing style operations which affect multiple documents. For example, if you had to lend a book to a person you might have wanted the bookmarked as lent-out, and the library visitor to have the book appended to their lent-out list.
If marking the book succeeded but marking the visitor failed, then the book is lent to no one. If marking the visitor first and failing to mark the book as lent, it can be double-lent. Further, there is a point in time in which the visitor “has” the book, but the book is not yet “held”.
Now, there are potential structural solutions to this problem in a document-oriented world, but let’s go with transactions. The newly available transaction support lets you wrap a transaction around the 2 operations. I like thinking of a Mongo transaction in terms of visibility:
Speaking of visibility, one of the core changes that needed to occur in the Mongo engine is marking OpLog entries with a global logical cluster time. Why is that important? Because transactions are really about controlling the visibility of written items across the Replica Set. The implementation of point-in-time reads is another key piece of the puzzle. This feature provides a view for reading such that the document states visible to the reader are of the same version they were when the read operation started. Modifications occurring during a long-running operation would not be exposed to the reader, so a consistent view is ensured.
To scope several operations into a transaction, Mongo relies on the already available session implementation. Sessions existed in 3.6, so the leap is smaller. A session groups a sequence of commands, and is tracked throughout the cluster. It is therefore already suitable for the notion of a transaction. All that was added is a syntax for a developer to pass along a session into the commands themselves. All method signatures that mutate data now accept a session handle in the latest official drivers.From a client perspective (using a supported driver version), creating a transaction looks like:
Mongo will control the visibility of the constituent operations according to the session settings.
Consider this session initiation C# code:
Demo code based on .Net driver mongodb.driver version 2.7.0
using (var session = client.StartSession()) |
A few things are readily apparent from this small code structure.
A session is a disposable object, so proper disposal is guaranteed by a using
claws.
A session by itself is not a transaction. We explicitly start a transaction by calling the StartTransaction()
method. Within a session, only one transaction may be “live” at a given time. Since we are within a using
scope, this code has a low risk of breaking that rule.
TransactionOptions
describe 2 key parts of the transaction treatment: read and write concerns. The write-concern describes the durability expectation of the mutations. Just like any Replica Set write, it lets us control the risk of roll-back of individual writes in case of primary transitions or other adverse events.
The read-concern describes the visibility mode of the mutation during the transaction- the time between the start and the would-be commit or abort commands. As mentioned earlier, what happens during the transaction lifetime before it is finished - successful or not - is really what transactions are about.
The particular setting of ReadConcern.Snapshot
, when paired with a write-concern WriteConcern.WMajority
guarantee that any reads occurring as part of a transaction view data that is majority committed. Those reads are “stable” and should not roll back since the majority of nodes already have applied that data.You might be tempted to use a weaker read-concern such as ReadConcern.Local
or ReadConcern.Majority
for sake of speed. That choice may not be treated as you expect. For one, Mongo might “upgrade” the concern to a higher one such as snapshot. Further, Mongo does not guarantee that the writes won’t be rolled back in the face of cluster adverse events. In case of a rollback, your whole transaction might be rolled back so what’s the point really?…
Snapshot is a read-concern relating to read-your-own-writes and causal consistency. Causal consistency describes a relationship between operations where one causes the other: A read operation returning the value of filed count right after a write operation setting count = 123 expects the count to be 123. The write preceding the read caused the mutation and the reader expects the write to be “the one caused by the preceding operation”. An implied order is what this is about. As mentioned before, one of the underpinnings of transactions is a global timestamp, allowing a strict order of operations. Within a causally consistent session, pure read operations following a write is guaranteed to see the results of that write. It may seem trivial - desirable certainly - but keep in mind that other concurrent writes may occur during your sequence which may affect the state of a document. Causal consistency assures that the state of a read document following a write is seen as that writer’s result.
In the diagram below, a session with causal consistency ensures the reader sees the results of its preceding write. A session with no causal consistency does not ensure that, and depending may result in Client One reading a document modified by Client Two rather than the result of Client One’s own write.
The default for creating a session is to create it with causal consistency. The code below creates a session with the default value or explicit option. Either of these result in the same.
// setting CausalConsistency explicitly |
Now we can state this: A transaction in a causally consistent session with a read-concern of “snapshot” and write-concern of “majority” containing reads, will view documents committed to a majority of the nodes. This guarantee level extends to reads within the transaction such that not only will the transaction writes succeed if majority acknowledged, but the reads within the transactions will also only rely on majority committed documents according to the snapshots time. This shuts down the concern of having a transaction rely on document state which might be rolled back because once majority-committed, it won’t be rolled back. This provides a consistent view since the causal consistency kicks in and guarantees that.
The theory above gives us the background necessary to understand what’s going on. The code below implements a multi-document transaction touching 3 documents across 3 different collections.
The scenario is that we have some Tool which can be borrowed by some Person and is then logged in the LendingLedger.We start by creating a new session. We then perform the sequence of operations inside the transaction:
Under the cover of a transaction, performing this sequence gives us assurance that all three entities would be manipulated to satisfaction, or rolled back completely. Further - other concurrent operations in parallel would not interfere with the operations happening inside this transaction.
For a more complete demonstration please see my GitHub repo.
using (var session = client.StartSession(new ClientSessionOptions { CausalConsistency = true })) |
Transactions have been long awaited by some for quite a while. Others see transactions as a performance and scalability hindrances, placing an undue burden on the core engine. There are performance implications to transactions. Measuring those is tricky because the effect depends on concurrency, velocity, and size of data. Transactions also introduce more controls on timing, with defaults favoring quick transactions and opting to abort rather than consume precious resources. How much overhead will a transaction introduce? I don’t know - better measure it. The documentation currently states only a vague warning:
…In most cases, multi-document transaction incurs a greater performance cost over single document writes, and the availability of multi-document transaction should not be a replacement for effective schema design …
Certainly, something to consider, and I for one definitely model entities with the mindset of embedding where appropriate. After all - if you want completely “independent” entities with cross-references, there’s an app for that… We chose a document-oriented-database for its document-oriented nature - let’s leverage that. A good rule of thumb is that if your RDBMS schema was translated into one-collection-per-table in MongoDB - try again.
Lastly, I should mention that the v4.0 release includes multi-document transactions on replica sets but not on sharded collections. Support for sharded collections with transactions is slated for V4.2.
Happy transacting!
]]>The Aggregation Framework in MongoDB gives us a convenient option: $lookup
!But $lookup
is not JOIN - at least not the one we are familiar with from SQL and relational databases. This distinction is important to understand, as it may affect your modeling and performance.
$lookup
queries a second collection for each document entering the pipeline stage. Any documents from the secondary collection that pass the lookup condition are attached to an array field on the entering document.
The behavior differs from SQL’s INNER and LEFT JOIN in its geometry. In SQL a row is emitted for each match of the right table, and produces a wide row with all columns from the left and all columns from the right - flat. Rows in the left table that had no match in the right table would not be produced at all. Similarly, a LEFT JOIN would have produced flat rows, and a NULL filled row for the right table in case no match was found for the right table.
The behavior also differs in the DDL aspects and expectation. MongoDB does not represent or maintain any kind of foreign key relationship between collection. While relational databases expose foreign keys in the DDL and extend them with things like cascade-delete and insert/update validation of key existence, MongoDB has no such constructs.
Apart from the geometrical and DML-only nature of $lookup
, it is also a bit more limited. You cannot lookup into a collection that is sharded. This limits the scale of the database where one could apply the syntax, and should give you pause to think through your modeling carefully before relying on this syntax.
The closest analog of $lookup
to SQL JOIN varieties is the LEFT OUTER JOIN.
At times, we want to simulate INNER JOIN. We may want to emit only documents from the first collection which have one or more match from the in the second collection.
There’s a way to do this using aggregation by chaining $lookup
and an $unwind
.
Full props to Asya Kamsky for this technique!
The subtle point is that $lookup
appends an array of values from the second collection for each input document, and an empty array if no match is found. That empty array is the key to this solution. $unwind
has the ability to suppress emitting documents where the array field is empty or missing. In fact, that was the default behavior of $unwind
all along.
Consider the following data sets:
db.cities.insert([ |
Now query the cities
to discover the inbound flights arriving in for each city. For each city, a list of routes is appended from the routes collection. The match is on the cities _id field against the routes arrival field, both of which contain the FAA 3 letter airport code.
{ |
The result for Toledo (‘TOL’) will show 2 inbound flights:
{ |
The result for St. Petersburg (PIE), will contain no inbound flights - an empty array. This is because there is no route with an arrival into it.
{ |
This behavior simulates an SQL LEFT OUTER JOIN. But we want to simulate an INNER JOIN. This is where $unwind
comes in. Since version 3.2, $unwind
includes the preserveNullAndEmptyArrays
option. When true
, documents containing empty arrays in the $unwind
field will be emitted from the stage. When false
, documents entering $unwind
with an empty array or missing field will be suppressed - won’t be emitted. The default value of preserveNullAndEmptyArrays
is false. So we can actually use the simplified backwards compatible syntax:
db.cities.aggregate([ |
This time, St. Petersburg (PIE) will be suppressed, because it had an empty array. Simulated INNER JOIN done!In addition to effectively simulating the INNER JOIN, this technique produces a geometry that matches the SQL counterpart a bit closer. The result documents after $unwind
contain a “flat” one-to-one document with all fields from the first collection and all fields from one document from the second collection (albeit demoted one level into the inbound field).
Relational databases expose 4 varieties of JOIN: INNER, LEFT OUTER, RIGHT OUTER, and FULL OUTER.
A RIGHT OUTER JOIN is just a LEFT OUTER JOIN, but on the opposite collection, right? Well, not exactly. First, the limitations for $lookup
with regards to sharding might prevent you from enjoying full symmetry. Second, you might need to “join” in a late pipeline stage, after a few manipulation of the first collection. In that case - again - it may be either difficult or impossible to reverse the roles of the 2 collections. Consider for example collection A that you use a $text match on, and want to leverage the full text index for filtering. Reversing and aggregating on collection B will prevent you from using the same index effectively.
FULL OUTER JOIN is flatly not supported by $lookup
. It would be difficult to implement at scale, and would obviously collide with the sharding limitation on one of the collections involved. Performance questions aside, one would also need to consider the meaning of a pipeline - the basis of the aggregate()
command. The pipeline feeds documents from the previous stage or a collection into a given stage. Under no circumstance do all documents from 2 separate collection enter into a stage. $lookup
is designed to bring in a subset of matched documents from a different collection, not all documents from a second collection. It may be tempting to simulate a FULL OUTER by using uncorrelated sub-query within the $lookup
syntax, using a pipeline
option that effectively drags in every document from the second collection. This will attach an array of every document from collection B into an array on every document from collection A. Ouch! Even if this worked, you’d surely need to chase this with a $group
or similar operator to get distinct values somehow. But very likely the pipeline will abort: The memory impact on the stage will be arbitrarily large. Finally, and even with allow-disk-usage, the document size limit of 16MB applies. So if collection B has more than 16MB, it will fail. If both collections are sufficiently small, an application-side join might be your best bet. Application side join is not very efficient, but can be acceptable for small-ish collections such as metadata type documents.
In general though, taking SQL operators and hoping that MongoDB or the Aggregation syntax would provide full parity is perilous. Taking the time to analyze the actual use cases and modeling properly using Document Oriented techniques would lead to better performance and cleaner designs.
]]>Since I’ve been working with both MongoDB and CosmosDB for various projects, I found using the MongoDB driver a big productivity boost, and pointing it to my database (either CosmosDB or a MongoDB) a convenient boost, eliminating the constant flip between shells, apps, and mindsets.
The trick is to install Cosmos DB Support for VS Code extension in VS Code. Once installed, it enables another icon on the VSCode activity bar.
Expanding that side bar gives you a tree view for the databases (yes, plural: you can have several connected at once).
To connect to a database using connection credentials, you invoke the CosmosDB: Attach Database Account command. Next, VSCode will ask you which interface you would like to use.
This is the great part: If you want to connect to a native MongoDB server, choose MongoDB as the interface, then paste in any* valid MongoDB url!
Although the extension is labeled CosmosDB all over the place, it uses the MongoDB driver for making MongoDB API connections to CosmosDB instances that have MongoDB API enabled. This means it works just as well for native MongoDB databases!
If connecting to a CosmosDB instance, the easy connection option is to attache by signing in to Azure. Hit the “sign in to Azure” link. This will pop up a toast notification in VS Code, telling you to open a browser and enter a nonce code into the web page prompt. The web page will then let you log in to Azure, and once that’s done you can close the browser. If everything went well, a list of your Azure subscriptions under that account will appear in the tree view. While this option is nice and true to Azure account provisioning form, it is not always a good fit for development shops that provision connection string based credentialed access, but don’t allow developers direct portal or azure resource access beyond that.
Once connected, you can see a listing of databases and collections in the tree view. To see some documents, double click a collection name. A listing of documents will appear in the tree view.
Double a document, and you will see its JSON representation as a docked tab, letting you visualize the document content with full JSON, outlining, and extended type support. Extended JSON represents the underlying BSON native types with full round trip fidelity.
{ |
In the above document sample, the field _id
contains an ObjectID type. Extended JSON writes it as an encapsulated value using the $oid
field name. Similarly, the field born
is assigned a BSON date type, represented with a $date
value field containing a canonical string date representation. This is an important thing when you deal with money for example, so that the decimal resolution and math operations on the field would be correct, which is why you should use $numberDecimal
- a high precision decimal - rather than a double data type.
When you open a document to view its content, you actually created a Document Tab. These tabs let you modify the single document using the full power of the IDE. JSON linting, outlining, formatting, syntax highlights - all those help you visualize and understand your document better. But you can also make modifications to the data! You can change fields and their content to shape the document to your liking. When done, you can update the source database 2 ways:
If you use the Update to Cloud button, the update will be carried out, and an output log line in the Output tab for the CosmosDB category will be revealed, showing the DB, collection name, and document id in URI-like format
10:49:12 AM: Updated entity "127.0.0.1:27017 (MongoDB)/db1/peeps/5b1eb2127f0f232ac43e0f42" |
If you use the File->Save method, a toast will appear first, to make sure you actually meant to update the database.
The subtlety is that you may have inadvertently requested saving all documents - including the edit - but do not want to actually update the database. This can prevent some class of accidental modifications. You do have the choice to dismiss these warnings by choosing the Always upload button on the toast.
Beware that when you “Update to Cloud”, the content of the document in the tab will completely replace any and all values in the target database. Fields that exist in the database but not in your tab will be removed. If anyone else made changes to the database since you loaded it into VS Code, those changes will be gone. MongoDB provides “patch” like surgical updates, touching individual fields, but those are not available in single document update.
Unlike the single document tab, Scrapbooks let you perform arbitrary Mongo shell commands against a connected database. This is where you have more control over updates, and can perform several operations rather than just slam update a single document. Contrary to the ephemeral connotation of the name “Scrapbook”, they can actually be persisted. Saving a scrapbook creates a new file in your root folder of the project, with the extension .mongo
.
To get a scrapbook started, either choose the “plus-document” icon from the tree view banner, or invoke the VS Code command “Cosmos DB: New Mongo Scrapbook”.
This will prompt you to connect anew, or choose an attached instance.
Choosing to select an already attached instance, will further list the currently connected instances.
Choose one of those, and you’re presented with a listing of the databases.
Choose a database (or create a new one), and you’re in business!
The scrapbook surface is a persisted file containing MongoDB shell commands. Because it is persisted in the file tree, you can save it, edit it, and most importantly: version it! Source control is a great way to ensure your scripting of data access tasks is durable, repeatable, shareable and maintainable.
With the scrapbook, you are not limited to touching one document. Updates can span multiple documents using the update()
or updateMany()
collection syntax, and bring you the full power of advanced update operators like $addToSet
, $inc
etc. Any other shell command is available, so you can also remove documents, insert some, write some light javascript code to fetch and manipulate - anything you need.
Below, I scripted an aggregation query to find out number of orders per month in my “orders” collection. Hitting [CTRL] + [SHIFT] + ;
executes the scrapbook in its entirety. (You can also execute only a portion of the code in the scrapbook: Select the code to run, and hit [CTRL] + [SHIFT] + '
.)
Overall, the tight integration into a the IDE is a great convenience. I’m not ready to ditch my MongoDB shell yet, but not having to switch back and forth as much for daily simple tasks is awesome.
If you are working with CosmosDB with the graph interface, this extension lets you visualize graphs graphically. If you are working with the SQL document interface, it lets you connect in that mode as well. This extension is quite powerful, and has a much richer UI and options - certainly exceeding what its simple name suggests.
* Current extension version requires that the MongoDB url begins with mongodb://
. While this works for most, it prevents using the perfectly valid prefix mongodb+srv://
used by some hosts. I submitted a pull request for this issue. Once merged, mongodb+srv://
should be usable as well.