Metadata Filtering Guide#
This guide explains how to use metadata filtering to exclude specific documents from similarity search results.
Overview#
When searching for similar documents, you may want to exclude results that share certain metadata values with your query. For example:
- Exclude documents from the same author when finding similar writing styles
- Filter out documents from the same source when finding related content
- Exclude documents with the same category when exploring diversity
embapi provides metadata filtering using query parameters that perform negative matching - they exclude documents where the metadata field matches the specified value.
Query Parameters#
Both similarity search endpoints support metadata filtering:
metadata_path: The JSON path to the metadata field (e.g.,author,source.id,tags[0])metadata_value: The value to exclude from results
Both parameters must be used together. If you specify one without the other, the API returns an error.
Basic Filtering Examples#
Exclude Documents by Author#
Find similar documents but exclude those from the same author:
curl -X GET "https://api.example.com/v1/similars/alice/literary-corpus/hamlet-soliloquy?count=10&metadata_path=author&metadata_value=William%20Shakespeare" \
-H "Authorization: Bearer alice_api_key"This returns similar documents, excluding any with metadata.author == "William Shakespeare".
Exclude Documents from Same Source#
Find similar content from different sources:
curl -X GET "https://api.example.com/v1/similars/alice/news-articles/article123?count=10&metadata_path=source&metadata_value=NYTimes" \
-H "Authorization: Bearer alice_api_key"This excludes any documents with metadata.source == "NYTimes".
Exclude by Category#
Find documents in different categories:
curl -X GET "https://api.example.com/v1/similars/alice/products/product456?count=10&metadata_path=category&metadata_value=electronics" \
-H "Authorization: Bearer alice_api_key"This excludes any documents with metadata.category == "electronics".
Filtering with Raw Embeddings#
Metadata filtering also works when searching with raw embedding vectors:
curl -X POST "https://api.example.com/v1/similars/alice/literary-corpus?count=10&metadata_path=author&metadata_value=William%20Shakespeare" \
-H "Authorization: Bearer alice_api_key" \
-H "Content-Type: application/json" \
-d '{
"vector": [0.032, -0.018, 0.056, ...]
}'This searches using the provided vector but excludes documents where metadata.author == "William Shakespeare".
Nested Metadata Paths#
For nested metadata objects, use dot notation:
Example Metadata Structure#
{
"author": {
"name": "Jane Doe",
"id": "author123",
"affiliation": "University"
},
"publication": {
"journal": "Science",
"year": 2023
}
}Filter by Nested Field#
# Exclude documents from same author ID
curl -X GET "https://api.example.com/v1/similars/alice/papers/paper001?count=10&metadata_path=author.id&metadata_value=author123" \
-H "Authorization: Bearer alice_api_key"
# Exclude documents from same journal
curl -X GET "https://api.example.com/v1/similars/alice/papers/paper001?count=10&metadata_path=publication.journal&metadata_value=Science" \
-H "Authorization: Bearer alice_api_key"Combining with Other Parameters#
Metadata filtering works seamlessly with other search parameters:
curl -X GET "https://api.example.com/v1/similars/alice/documents/doc123?count=20&threshold=0.8&limit=10&offset=0&metadata_path=source_id&metadata_value=src_456" \
-H "Authorization: Bearer alice_api_key"Parameters:
count=20: Consider top 20 similar documentsthreshold=0.8: Only include documents with similarity ≥ 0.8limit=10: Return at most 10 resultsoffset=0: Start from first result (for pagination)metadata_path=source_id: Filter on this metadata fieldmetadata_value=src_456: Exclude documents with this value
Use Cases#
1. Finding Similar Writing Styles Across Authors#
When analyzing writing styles, you want similar texts from different authors:
# Upload documents with author metadata
curl -X POST "https://api.example.com/v1/embeddings/alice/writing-styles" \
-H "Authorization: Bearer alice_api_key" \
-H "Content-Type: application/json" \
-d '{
"embeddings": [{
"text_id": "tolstoy-passage-1",
"instance_handle": "openai-large",
"vector": [0.1, 0.2, ...],
"vector_dim": 3072,
"text": "Happy families are all alike...",
"metadata": {
"author": "Leo Tolstoy",
"work": "Anna Karenina",
"language": "Russian"
}
}]
}'
# Find similar writing styles from other authors
curl -X GET "https://api.example.com/v1/similars/alice/writing-styles/tolstoy-passage-1?count=10&metadata_path=author&metadata_value=Leo%20Tolstoy" \
-H "Authorization: Bearer alice_api_key"2. Cross-Source Content Discovery#
Find related news articles from different sources:
# Search for similar content, excluding same source
curl -X GET "https://api.example.com/v1/similars/alice/news-corpus/nyt-article-456?count=15&metadata_path=source&metadata_value=New%20York%20Times" \
-H "Authorization: Bearer alice_api_key"This helps discover how different outlets cover similar topics.
3. Product Recommendations Across Categories#
Find similar products in different categories:
# User is viewing a laptop
curl -X GET "https://api.example.com/v1/similars/alice/product-catalog/laptop-001?count=10&threshold=0.7&metadata_path=category&metadata_value=electronics" \
-H "Authorization: Bearer alice_api_key"This could recommend accessories, furniture (for home office), or other complementary items.
4. Research Paper Discovery#
Find related papers from different research groups:
curl -X GET "https://api.example.com/v1/similars/alice/research-papers/paper123?count=20&metadata_path=lab_id&metadata_value=lab_abc_001" \
-H "Authorization: Bearer alice_api_key"Helps researchers discover related work from other institutions.
5. Avoiding Duplicate Content#
When building a diverse content feed, exclude items from the same collection:
curl -X GET "https://api.example.com/v1/similars/alice/blog-posts/post789?count=5&metadata_path=collection_id&metadata_value=series_xyz" \
-H "Authorization: Bearer alice_api_key"6. Cross-Language Document Discovery#
Find similar documents in other languages:
curl -X GET "https://api.example.com/v1/similars/alice/multilingual-docs/doc_en_123?count=10&metadata_path=language&metadata_value=en" \
-H "Authorization: Bearer alice_api_key"This finds semantically similar documents in languages other than English.
Working with Multiple Values#
Currently, you can only filter by one metadata field at a time. To exclude multiple values, you need to:
- Make multiple requests and merge results in your application
- Use more specific metadata fields that combine multiple attributes
- Post-process results on the client side
Example: Excluding Multiple Authors#
import requests
def find_similar_excluding_authors(doc_id, exclude_authors):
"""Find similar docs excluding multiple authors"""
all_results = []
for author in exclude_authors:
response = requests.get(
f"https://api.example.com/v1/similars/alice/corpus/{doc_id}",
headers={"Authorization": "Bearer alice_api_key"},
params={
"count": 20,
"metadata_path": "author",
"metadata_value": author
}
)
results = response.json()['results']
all_results.extend(results)
# Deduplicate and sort by similarity
seen = set()
unique_results = []
for r in sorted(all_results, key=lambda x: x['similarity'], reverse=True):
if r['id'] not in seen:
seen.add(r['id'])
unique_results.append(r)
return unique_results[:10]
# Usage
similar = find_similar_excluding_authors(
"doc123",
["Author A", "Author B", "Author C"]
)Combining with Metadata Validation#
For reliable filtering, combine with metadata schema validation:
# Step 1: Create project with metadata schema
curl -X POST "https://api.example.com/v1/projects/alice/validated-corpus" \
-H "Authorization: Bearer alice_api_key" \
-H "Content-Type: application/json" \
-d '{
"project_handle": "validated-corpus",
"description": "Corpus with validated metadata",
"instance_id": 123,
"metadataScheme": "{\"type\":\"object\",\"properties\":{\"author\":{\"type\":\"string\"},\"source_id\":{\"type\":\"string\"}},\"required\":[\"author\",\"source_id\"]}"
}'
# Step 2: Upload embeddings with metadata
curl -X POST "https://api.example.com/v1/embeddings/alice/validated-corpus" \
-H "Authorization: Bearer alice_api_key" \
-H "Content-Type: application/json" \
-d '{
"embeddings": [{
"text_id": "doc001",
"instance_handle": "openai-large",
"vector": [0.1, 0.2, ...],
"vector_dim": 3072,
"metadata": {
"author": "John Doe",
"source_id": "source_123"
}
}]
}'
# Step 3: Search with guaranteed metadata field existence
curl -X GET "https://api.example.com/v1/similars/alice/validated-corpus/doc001?count=10&metadata_path=author&metadata_value=John%20Doe" \
-H "Authorization: Bearer alice_api_key"See the Metadata Validation Guide for more details.
Understanding the Filter Logic#
The metadata filter uses negative matching:
INCLUDE document IF:
- document.similarity >= threshold
AND
- document.metadata[metadata_path] != metadata_valueImportant: Documents without the specified metadata field are included (not filtered out).
Example#
Given this query:
?metadata_path=author&metadata_value=AliceIncluded:
- Documents where
metadata.author == "Bob" - Documents where
metadata.author == "Charlie" - Documents without an
authorfield in metadata
Excluded:
- Documents where
metadata.author == "Alice"
Performance Considerations#
Metadata filtering is performed at the database level using efficient indexing:
- Vector similarity is computed first
- Metadata filter is applied to the similarity results
- Results are sorted and limited
For best performance:
- Use indexed metadata fields when possible
- Keep metadata values relatively small (under 1KB per document)
- Consider using IDs instead of full names for filtering
Error Handling#
Missing One Parameter#
# Missing metadata_value
curl -X GET "https://api.example.com/v1/similars/alice/corpus/doc123?metadata_path=author" \
-H "Authorization: Bearer alice_api_key"Error:
{
"title": "Bad Request",
"status": 400,
"detail": "metadata_path and metadata_value must be used together"
}Non-Existent Metadata Field#
# Filtering on field that doesn't exist in documents
curl -X GET "https://api.example.com/v1/similars/alice/corpus/doc123?count=10&metadata_path=nonexistent_field&metadata_value=some_value" \
-H "Authorization: Bearer alice_api_key"Result: Returns all matching documents (since none have the field, none are excluded).
URL Encoding#
Remember to URL-encode metadata values with special characters:
# Correct: URL-encoded value
curl -X GET "https://api.example.com/v1/similars/alice/corpus/doc123?metadata_path=author&metadata_value=John%20Doe%20%26%20Jane%20Smith" \
-H "Authorization: Bearer alice_api_key"
# Incorrect: Unencoded special characters
curl -X GET "https://api.example.com/v1/similars/alice/corpus/doc123?metadata_path=author&metadata_value=John Doe & Jane Smith" \
-H "Authorization: Bearer alice_api_key"Complete Example#
Here’s a complete workflow demonstrating metadata filtering:
# 1. Create project
curl -X POST "https://api.example.com/v1/projects/alice/literature" \
-H "Authorization: Bearer alice_api_key" \
-H "Content-Type: application/json" \
-d '{
"project_handle": "literature",
"instance_id": 123
}'
# 2. Upload documents with metadata
curl -X POST "https://api.example.com/v1/embeddings/alice/literature" \
-H "Authorization: Bearer alice_api_key" \
-H "Content-Type: application/json" \
-d '{
"embeddings": [
{
"text_id": "tolstoy_1",
"instance_handle": "openai-large",
"vector": [0.1, 0.2, ...],
"vector_dim": 3072,
"text": "All happy families...",
"metadata": {"author": "Tolstoy", "work": "Anna Karenina"}
},
{
"text_id": "tolstoy_2",
"instance_handle": "openai-large",
"vector": [0.11, 0.21, ...],
"vector_dim": 3072,
"text": "It was the best of times...",
"metadata": {"author": "Tolstoy", "work": "War and Peace"}
},
{
"text_id": "dickens_1",
"instance_handle": "openai-large",
"vector": [0.12, 0.19, ...],
"vector_dim": 3072,
"text": "It was the age of wisdom...",
"metadata": {"author": "Dickens", "work": "Tale of Two Cities"}
}
]
}'
# 3. Find similar to tolstoy_1, excluding Tolstoy's works
curl -X GET "https://api.example.com/v1/similars/alice/literature/tolstoy_1?count=10&metadata_path=author&metadata_value=Tolstoy" \
-H "Authorization: Bearer alice_api_key"
# Result: Returns dickens_1, excludes tolstoy_2Related Documentation#
- RAG Workflow Guide - Complete RAG implementation
- Metadata Validation Guide - Schema validation
- Batch Operations Guide - Upload large datasets
Troubleshooting#
No Results Returned#
Problem: Filter excludes all results
Solution:
- Verify the metadata field exists in your documents
- Check that the metadata value matches exactly (case-sensitive)
- Try without the filter to ensure there are similar documents
Filter Not Working#
Problem: Still seeing documents you want to exclude
Solution:
- Check URL encoding of the metadata value
- Verify the metadata path is correct (use dot notation for nested fields)
- Ensure both
metadata_pathandmetadata_valueare specified
Want Positive Matching#
Problem: Want to include only specific values, not exclude them
Solution: Currently, only negative matching (exclusion) is supported. For positive matching, retrieve all results and filter on the client side, or use multiple negative filters to exclude everything except your target values.