Troubleshooting Mongo DB Sources
Connector Limitations
MongoDB Oplog and Change Streams
MongoDB's Change Streams are based on the Replica Set Oplog. This has retention limitations. Syncs that run less frequently than the retention period of the Oplog may encounter issues with missing data.
We recommend adjusting the Oplog size for your MongoDB cluster to ensure it holds at least 24 hours of changes. For optimal results, we suggest expanding it to maintain a week's worth of data. To adjust your Oplog size, see the corresponding tutorials for MongoDB Atlas (fully-managed) and MongoDB shell (self-hosted).
If you are running into an issue similar to "invalid resume token", it may mean you need to:
- Increase the Oplog retention period.
- Increase the Oplog size.
- Increase the Airbyte sync frequency.
You can run the commands outlined in this tutorial to verify the current of your Oplog. The expect output is:
configured oplog size: 10.10546875MB
log length start to end: 94400 (26.22hrs)
oplog first event time: Mon Mar 19 2012 13:50:38 GMT-0400 (EDT)
oplog last event time: Wed Oct 03 2012 14:59:10 GMT-0400 (EDT)
now: Wed Oct 03 2012 15:00:21 GMT-0400 (EDT)
When importing a large MongoDB collection for the first time, the import duration might exceed the Oplog retention period. The Oplog is crucial for incremental updates, and an invalid resume token will require the MongoDB collection to be re-imported to ensure no source updates were missed.
MongoDB CDC Limitations
MongoDB has a 16MB maximum document size limit for BSON documents. During CDC (Change Data Capture) syncs, change stream events can exceed this limit when documents are large, causing a BSONObjectTooLarge error. This typically occurs during incremental syncs when change stream events include the full document content.
If you encounter this error, you have several options to resolve it:
- Switch the affected stream to Full Refresh sync mode instead of Incremental mode. Full Refresh does not use change streams and is not subject to this limitation.
- If you are using Post Image update capture mode, switch to Lookup mode. Lookup mode retrieves the current document state separately, which can reduce the size of change stream events.
- Restructure large documents in your MongoDB collection to stay under the 16MB limit.
- Deselect streams containing documents that exceed the size limit.
For more information about MongoDB's document size limits, see the MongoDB documentation on limits.
Supported MongoDB Clusters
- Only supports replica set cluster type.
- TLS/SSL is required by this connector. TLS/SSL is enabled by default for MongoDB Atlas clusters. To enable TSL/SSL connection for a self-hosted MongoDB instance, please refer to MongoDb Documentation.
- Views, capped collections and clustered collections are not supported.
- Empty collections are excluded from schema discovery.
- Collections with different data types for the values in the
_idfield among the documents in a collection are not supported. All_idvalues within the collection must be the same data type. - Atlas DB cluster are only supported in a dedicated M10 tier and above. Lower tiers may fail during connection setup.
Schema Discovery & Enforcement
- Schema discovery uses sampling of the documents to collect all distinct top-level fields. This value is universally applied to all collections discovered in the target database. The approach is modelled after MongoDB Compass sampling and is used for efficiency. By default, 10,000 documents are sampled. This value can be increased up to 100,000 documents to increase the likelihood that all fields will be discovered. However, the trade-off is time, as a higher value will take the process longer to sample the collection.
- When running with Schema Enforced set to
false, there is no attempt to discover any schema. See more in Schema Enforcement.
Schema discovery performance impact
Because MongoDB collections are schemaless, documents in the same collection can have different fields and data types. The connector attempts to infer a schema by sampling documents, but no sample size can guarantee a complete or stable schema. New fields can be added to documents at any time, and a schema derived from today's sample may not represent tomorrow's data. Keep this inherent limitation in mind when choosing between schema-enforced and schemaless modes.
When schema enforcement is enabled, the Discover phase executes a $sample aggregation pipeline against every collection in each configured database. These pipelines run concurrently using parallel threads, one per collection. Each pipeline samples up to 10,000 documents by default, then processes them through $project, $unwind, and $group stages to extract field names and types.
On clusters with hundreds of collections, this means hundreds of simultaneous aggregation queries hitting the database at once. The $sample stage performs a random collection scan, which can be I/O-intensive on large collections. Combined with the downstream aggregation stages, this can exhaust available CPU and memory on your MongoDB nodes.
Recommended approaches
These approaches address the root cause of the performance risk by reducing or eliminating the discovery workload.
-
Disable schema enforcement. Set Schema Enforced to
falseto skip the sampling-based discovery entirely. In schemaless mode, the connector samples only one document per collection to confirm the_idfield exists. This dramatically reduces the load on your cluster, but all data is returned as a single JSON object per document rather than individual typed fields. See Schema Enforcement for configuration details. -
Reduce the discovery sample size. If you need schema enforcement, lower the Discovery Sample Size setting to reduce the number of documents sampled per collection. The default is 10,000. A smaller value such as 1,000 reduces the load on your cluster but may miss fields in collections with highly variable document structures. See the Discovery Sample Size configuration parameter.
Other alternatives
These approaches do not reduce the discovery workload itself, but can help isolate it from your production traffic.
-
Direct reads to a secondary node. Add
readPreference=secondaryorreadPreference=secondaryPreferredto your MongoDB connection string. This routes the discovery queries to a secondary replica set member instead of the primary, protecting your primary node from the additional load.mongodb+srv://cluster0.abcd1.mongodb.net/?readPreference=secondaryPreferred -
Use MongoDB Atlas analytics nodes. If you use MongoDB Atlas (M10 tier or above), you can provision analytics nodes that are isolated from your operational workload. Direct Airbyte's reads to an analytics node by adding read preference tags to your connection string:
mongodb+srv://cluster0.abcd1.mongodb.net/?readPreference=secondary&readPreferenceTags=nodeType:ANALYTICSThis fully isolates the discovery workload from your production traffic.
-
Schedule syncs during off-peak hours. If you cannot isolate the read workload, schedule your Airbyte syncs to run during periods of low production traffic. Schema discovery runs at the start of every sync, so timing matters.
-
Reduce the number of configured databases. The connector discovers collections across all configured databases. If you only need data from specific databases, remove unnecessary databases from your source configuration to reduce the total number of collections discovered.
Vendor-Specific Connector Limitations
Not all implementations or deployments of a database will be the same. This section lists specific limitations and known issues with the connector based on how or where it is deployed.
Self Hosted MongoDB
Airbyte does not support self-signed SSL certificates for SSH tunnels.
AWS DocumentDB
The Airbyte connector does not support custom SSL certificates, which DocumentDB requires.