It is crucial for my application to be able to randomly select multiple documents from a collection in Firebase.
Since there is no built-in native function in Firebase (that I know of) to implement a query that does this, my first thought was to use a query cursor to pick a random start and end index, assuming I had The number of documents in the numeric collection.
This approach would work, but only in a limited way, as each document will be served sequentially with its neighboring documents each time; however, if I am able to select a document by its index in its parent collection, I can Implementing a random document query, but the problem is that I can't find any documentation describing how to do this, or even if it is possible to do this.
This is what I want to do, consider the following firestore architecture:
root/ posts/ docA docB docC docD
Then on my client side (I'm in a Swift environment) I want to write a query that does this:
db.collection("posts")[0, 1, 3] // would return: docA, docB, docD
Can I do something similar? Alternatively, is there any other way to select random documents in a similar way?
please help.
Posting this to help anyone who encounters this problem in the future.
If you use auto IDs, you can generate new auto IDs and query for the closest auto ID, asDan McGrath's answer.
I recently created a random quotes api that needed to get random quotes from a firestore collection.
This is how I solved this problem:
The key to the query is this:
If the document is not found, call it again with the opposite operation.
Hope this helps!
Using a randomly generated index and a simple query, you can randomly select documents from a collection or collection group in Cloud Firestore.
This answer is divided into 4 parts, each part has different options:
How to generate random index
The basis of this answer is to create an index field that, when sorted in ascending or descending order, will cause all documents to be sorted randomly. There are a number of different ways to create this, so let's look at 2, starting with the most accessible method.
Automatically identify version
If you use the randomly generated automatic IDs provided in our client library, you can use the same system to randomly select documents. In this case, the randomly ordered indexis thedocument ID.
Later in our query section, the random value you generate is a new automatic ID (iOS,Android,Web) that you The field queried is the
__name__
field, and the "low value" mentioned later is an empty string. This is by far the simplest way to generate a random index, and will work regardless of language and platform.By default, document names (
__name__
) are only indexed in ascending order, and you cannot rename existing documents except by deleting and recreating them. If you need either of these, you can still use this method, just store the automatic ID as an actual field namedrandom
instead of overloading the document name for this purpose.Random integer version
When you write a document, you first generate a random integer in a bounded range and set it to a field named
random
. Depending on the number of documents you expect, you can use different bounded ranges to save space or reduce the risk of conflicts (which reduces the effectiveness of this technique).You should consider which language you need as there will be different considerations. Although Swift is simple, JavaScript has a notable problem:
This will create an index with documents sorted randomly. Later in our query section, the random value you generate will be another of these values, and the "low value" mentioned later will be -1.
How to query random index
Now that you have a random index, you will need to query it. Below we look at some simple variations that select 1 random document, as well as options for selecting multiple 1 documents.
For all of these options, you need to generate a new random value in the same form as the index value you created when writing the document, represented by the variable
random
below. We will use this value to find random points on the index.Surround
Now that you have random values, you can query individual documents:
Check if the document has been returned. If not, query again, but with the "low value" of the random index. For example, if you do random integers,
lowValue
is0
:As long as you have one document, you are guaranteed to return at least 1 document.
Both directions
The wraparound method is simple to implement and allows you to optimize storage with only ascending indexes enabled. One disadvantage is that values may be unfairly protected. For example, if the first 3 documents in 10K (A, B, C) have random index values A:409496, B:436496, C:818992, then the chance of A and C being selected is less than 1/10K, while B will be selected because of A is effectively shielded from close proximity, and has only about a 1/160K chance.
Instead of querying one way and wrapping around if a value is not found, you can randomly choose between
>=
and, which reduces the probability of unfairly masking a value Halved at the cost of doubling index storage.
If no result is returned in one direction, switch to the other direction:
Select multiple random documents
Typically, you need to select multiple random documents at once. There are two different ways to adapt the above techniques depending on the trade-offs you want.
Rinse and repeat
This method is very simple. Just repeat the process, including choosing a new random integer each time.
This method will give you a random sequence of documents without having to worry about seeing the same pattern repeatedly.
The trade-off is that it will be slower than the next method since it requires a separate round trip to serve each document.
Keep it up
In this method, just increase the limit number of required documents. This is a bit complicated because you may be returning
0..limit
documents in the call. You then need to get the missing document in the same way, but with the limitations reduced to just the differences. If you know that the total number of documents is more than you ask for, you can optimize by ignoring the edge case where enough documents are never retrieved on the second call (but not the first).The trade-off with this solution is the repeating sequence. Although the documents are sorted randomly, if you end up with overlapping ranges, you'll see the same pattern you saw before. There are ways to alleviate this concern, which we will discuss in the next section on reseeding.
This method is faster than "rinse and repeat" because you will request all documents in one call in the best case or two calls in the worst case.
Reseed for consistent randomness
While this method will give you documents randomly if the document set is static, the probability of returning each document will also be static. This is a problem because some values may have unfairly low or high probabilities depending on the initial random value they were obtained from. In many use cases this is fine, but in some you may want to increase the long-term randomness so that there is a more even chance of any 1 document being returned.
Note that inserted documents will eventually be intertwined, gradually changing the probability, and the same will be true for deleted documents. If the insertion/deletion rate is too small for a given number of documents, there are some strategies to solve this problem.
Multiple random
You don't have to worry about reseeding, you can always create multiple random indexes per document and then randomly select one of them each time. For example, let field
random
be a map containing subfields 1 to 3:Now you will randomly query random.1, random.2, random.3, creating a larger distribution of randomness. This essentially uses increased storage space to save the increased computation (document writing) of reseeding.
Reset seed when writing
Every time the document is updated, the random value of the
random
field will be regenerated. This will move the documents in a random index.Reseeding on read
If the generated random values are not uniformly distributed (they are random, so this is expected), the same document may be selected at inappropriate times. This problem can be easily solved by updating a randomly selected document with new random values after reading it.
Since writes are more expensive and can become hotspots, you may choose to update only on a subset of read times (e.g.,
if random(0,100) === 0) update;
).