Metadata Filtering Guide#

This guide explains how to use metadata filtering to exclude specific documents from similarity search results.

Overview#

When searching for similar documents, you may want to exclude results that share certain metadata values with your query. For example:

Exclude documents from the same author when finding similar writing styles
Filter out documents from the same source when finding related content
Exclude documents with the same category when exploring diversity

embapi provides metadata filtering using query parameters that perform negative matching - they exclude documents where the metadata field matches the specified value.

Query Parameters#

Both similarity search endpoints support metadata filtering:

metadata_path: The JSON path to the metadata field (e.g., author, source.id, tags[0])
metadata_value: The value to exclude from results

Both parameters must be used together. If you specify one without the other, the API returns an error.

Basic Filtering Examples#

Exclude Documents by Author#

Find similar documents but exclude those from the same author:

curl -X GET "https://api.example.com/v1/similars/alice/literary-corpus/hamlet-soliloquy?count=10&metadata_path=author&metadata_value=William%20Shakespeare" \
  -H "Authorization: Bearer alice_api_key"

This returns similar documents, excluding any with metadata.author == "William Shakespeare".

Exclude Documents from Same Source#

Find similar content from different sources:

curl -X GET "https://api.example.com/v1/similars/alice/news-articles/article123?count=10&metadata_path=source&metadata_value=NYTimes" \
  -H "Authorization: Bearer alice_api_key"

This excludes any documents with metadata.source == "NYTimes".

Exclude by Category#

Find documents in different categories:

curl -X GET "https://api.example.com/v1/similars/alice/products/product456?count=10&metadata_path=category&metadata_value=electronics" \
  -H "Authorization: Bearer alice_api_key"

This excludes any documents with metadata.category == "electronics".

Filtering with Raw Embeddings#

Metadata filtering also works when searching with raw embedding vectors:

curl -X POST "https://api.example.com/v1/similars/alice/literary-corpus?count=10&metadata_path=author&metadata_value=William%20Shakespeare" \
  -H "Authorization: Bearer alice_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "vector": [0.032, -0.018, 0.056, ...]
  }'

This searches using the provided vector but excludes documents where metadata.author == "William Shakespeare".

Nested Metadata Paths#

For nested metadata objects, use dot notation:

Example Metadata Structure#

{
  "author": {
    "name": "Jane Doe",
    "id": "author123",
    "affiliation": "University"
  },
  "publication": {
    "journal": "Science",
    "year": 2023
  }
}

Filter by Nested Field#

# Exclude documents from same author ID
curl -X GET "https://api.example.com/v1/similars/alice/papers/paper001?count=10&metadata_path=author.id&metadata_value=author123" \
  -H "Authorization: Bearer alice_api_key"

# Exclude documents from same journal
curl -X GET "https://api.example.com/v1/similars/alice/papers/paper001?count=10&metadata_path=publication.journal&metadata_value=Science" \
  -H "Authorization: Bearer alice_api_key"

Combining with Other Parameters#

Metadata filtering works seamlessly with other search parameters:

curl -X GET "https://api.example.com/v1/similars/alice/documents/doc123?count=20&threshold=0.8&limit=10&offset=0&metadata_path=source_id&metadata_value=src_456" \
  -H "Authorization: Bearer alice_api_key"

Parameters:

count=20: Consider top 20 similar documents
threshold=0.8: Only include documents with similarity ≥ 0.8
limit=10: Return at most 10 results
offset=0: Start from first result (for pagination)
metadata_path=source_id: Filter on this metadata field
metadata_value=src_456: Exclude documents with this value

Use Cases#

1. Finding Similar Writing Styles Across Authors#

When analyzing writing styles, you want similar texts from different authors:

# Upload documents with author metadata
curl -X POST "https://api.example.com/v1/embeddings/alice/writing-styles" \
  -H "Authorization: Bearer alice_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "embeddings": [{
      "text_id": "tolstoy-passage-1",
      "instance_handle": "openai-large",
      "vector": [0.1, 0.2, ...],
      "vector_dim": 3072,
      "text": "Happy families are all alike...",
      "metadata": {
        "author": "Leo Tolstoy",
        "work": "Anna Karenina",
        "language": "Russian"
      }
    }]
  }'

# Find similar writing styles from other authors
curl -X GET "https://api.example.com/v1/similars/alice/writing-styles/tolstoy-passage-1?count=10&metadata_path=author&metadata_value=Leo%20Tolstoy" \
  -H "Authorization: Bearer alice_api_key"

2. Cross-Source Content Discovery#

Find related news articles from different sources:

# Search for similar content, excluding same source
curl -X GET "https://api.example.com/v1/similars/alice/news-corpus/nyt-article-456?count=15&metadata_path=source&metadata_value=New%20York%20Times" \
  -H "Authorization: Bearer alice_api_key"

This helps discover how different outlets cover similar topics.

3. Product Recommendations Across Categories#

Find similar products in different categories:

# User is viewing a laptop
curl -X GET "https://api.example.com/v1/similars/alice/product-catalog/laptop-001?count=10&threshold=0.7&metadata_path=category&metadata_value=electronics" \
  -H "Authorization: Bearer alice_api_key"

This could recommend accessories, furniture (for home office), or other complementary items.

4. Research Paper Discovery#

Find related papers from different research groups:

curl -X GET "https://api.example.com/v1/similars/alice/research-papers/paper123?count=20&metadata_path=lab_id&metadata_value=lab_abc_001" \
  -H "Authorization: Bearer alice_api_key"

Helps researchers discover related work from other institutions.

5. Avoiding Duplicate Content#

When building a diverse content feed, exclude items from the same collection:

curl -X GET "https://api.example.com/v1/similars/alice/blog-posts/post789?count=5&metadata_path=collection_id&metadata_value=series_xyz" \
  -H "Authorization: Bearer alice_api_key"

6. Cross-Language Document Discovery#

Find similar documents in other languages:

curl -X GET "https://api.example.com/v1/similars/alice/multilingual-docs/doc_en_123?count=10&metadata_path=language&metadata_value=en" \
  -H "Authorization: Bearer alice_api_key"

This finds semantically similar documents in languages other than English.

Working with Multiple Values#

Currently, you can only filter by one metadata field at a time. To exclude multiple values, you need to:

Make multiple requests and merge results in your application
Use more specific metadata fields that combine multiple attributes
Post-process results on the client side

Example: Excluding Multiple Authors#

import requests

def find_similar_excluding_authors(doc_id, exclude_authors):
    """Find similar docs excluding multiple authors"""
    all_results = []
    
    for author in exclude_authors:
        response = requests.get(
            f"https://api.example.com/v1/similars/alice/corpus/{doc_id}",
            headers={"Authorization": "Bearer alice_api_key"},
            params={
                "count": 20,
                "metadata_path": "author",
                "metadata_value": author
            }
        )
        results = response.json()['results']
        all_results.extend(results)
    
    # Deduplicate and sort by similarity
    seen = set()
    unique_results = []
    for r in sorted(all_results, key=lambda x: x['similarity'], reverse=True):
        if r['id'] not in seen:
            seen.add(r['id'])
            unique_results.append(r)
    
    return unique_results[:10]

# Usage
similar = find_similar_excluding_authors(
    "doc123",
    ["Author A", "Author B", "Author C"]
)

Combining with Metadata Validation#

For reliable filtering, combine with metadata schema validation:

# Step 1: Create project with metadata schema
curl -X POST "https://api.example.com/v1/projects/alice/validated-corpus" \
  -H "Authorization: Bearer alice_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "project_handle": "validated-corpus",
    "description": "Corpus with validated metadata",
    "instance_id": 123,
    "metadataScheme": "{\"type\":\"object\",\"properties\":{\"author\":{\"type\":\"string\"},\"source_id\":{\"type\":\"string\"}},\"required\":[\"author\",\"source_id\"]}"
  }'

# Step 2: Upload embeddings with metadata
curl -X POST "https://api.example.com/v1/embeddings/alice/validated-corpus" \
  -H "Authorization: Bearer alice_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "embeddings": [{
      "text_id": "doc001",
      "instance_handle": "openai-large",
      "vector": [0.1, 0.2, ...],
      "vector_dim": 3072,
      "metadata": {
        "author": "John Doe",
        "source_id": "source_123"
      }
    }]
  }'

# Step 3: Search with guaranteed metadata field existence
curl -X GET "https://api.example.com/v1/similars/alice/validated-corpus/doc001?count=10&metadata_path=author&metadata_value=John%20Doe" \
  -H "Authorization: Bearer alice_api_key"

See the Metadata Validation Guide for more details.

Understanding the Filter Logic#

The metadata filter uses negative matching:

INCLUDE document IF:
  - document.similarity >= threshold
  AND
  - document.metadata[metadata_path] != metadata_value

Important: Documents without the specified metadata field are included (not filtered out).

Example#

Given this query:

?metadata_path=author&metadata_value=Alice

Included:

Documents where metadata.author == "Bob"
Documents where metadata.author == "Charlie"
Documents without an author field in metadata

Excluded:

Documents where metadata.author == "Alice"

Performance Considerations#

Metadata filtering is performed at the database level using efficient indexing:

Vector similarity is computed first
Metadata filter is applied to the similarity results
Results are sorted and limited

For best performance:

Use indexed metadata fields when possible
Keep metadata values relatively small (under 1KB per document)
Consider using IDs instead of full names for filtering

Error Handling#

Missing One Parameter#

# Missing metadata_value
curl -X GET "https://api.example.com/v1/similars/alice/corpus/doc123?metadata_path=author" \
  -H "Authorization: Bearer alice_api_key"

Error:

{
  "title": "Bad Request",
  "status": 400,
  "detail": "metadata_path and metadata_value must be used together"
}

Non-Existent Metadata Field#

# Filtering on field that doesn't exist in documents
curl -X GET "https://api.example.com/v1/similars/alice/corpus/doc123?count=10&metadata_path=nonexistent_field&metadata_value=some_value" \
  -H "Authorization: Bearer alice_api_key"

Result: Returns all matching documents (since none have the field, none are excluded).

URL Encoding#

Remember to URL-encode metadata values with special characters:

# Correct: URL-encoded value
curl -X GET "https://api.example.com/v1/similars/alice/corpus/doc123?metadata_path=author&metadata_value=John%20Doe%20%26%20Jane%20Smith" \
  -H "Authorization: Bearer alice_api_key"

# Incorrect: Unencoded special characters
curl -X GET "https://api.example.com/v1/similars/alice/corpus/doc123?metadata_path=author&metadata_value=John Doe & Jane Smith" \
  -H "Authorization: Bearer alice_api_key"

Complete Example#

Here’s a complete workflow demonstrating metadata filtering:

# 1. Create project
curl -X POST "https://api.example.com/v1/projects/alice/literature" \
  -H "Authorization: Bearer alice_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "project_handle": "literature",
    "instance_id": 123
  }'

# 2. Upload documents with metadata
curl -X POST "https://api.example.com/v1/embeddings/alice/literature" \
  -H "Authorization: Bearer alice_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "embeddings": [
      {
        "text_id": "tolstoy_1",
        "instance_handle": "openai-large",
        "vector": [0.1, 0.2, ...],
        "vector_dim": 3072,
        "text": "All happy families...",
        "metadata": {"author": "Tolstoy", "work": "Anna Karenina"}
      },
      {
        "text_id": "tolstoy_2",
        "instance_handle": "openai-large",
        "vector": [0.11, 0.21, ...],
        "vector_dim": 3072,
        "text": "It was the best of times...",
        "metadata": {"author": "Tolstoy", "work": "War and Peace"}
      },
      {
        "text_id": "dickens_1",
        "instance_handle": "openai-large",
        "vector": [0.12, 0.19, ...],
        "vector_dim": 3072,
        "text": "It was the age of wisdom...",
        "metadata": {"author": "Dickens", "work": "Tale of Two Cities"}
      }
    ]
  }'

# 3. Find similar to tolstoy_1, excluding Tolstoy's works
curl -X GET "https://api.example.com/v1/similars/alice/literature/tolstoy_1?count=10&metadata_path=author&metadata_value=Tolstoy" \
  -H "Authorization: Bearer alice_api_key"

# Result: Returns dickens_1, excludes tolstoy_2

RAG Workflow Guide - Complete RAG implementation
Metadata Validation Guide - Schema validation
Batch Operations Guide - Upload large datasets

Troubleshooting#

No Results Returned#

Problem: Filter excludes all results

Solution:

Verify the metadata field exists in your documents
Check that the metadata value matches exactly (case-sensitive)
Try without the filter to ensure there are similar documents

Filter Not Working#

Problem: Still seeing documents you want to exclude