Metadata#
Structured JSON data attached to embeddings for organization, validation, and filtering.
Overview#
Metadata provides context and structure for your embeddings:
- Organization: Categorize and group documents
- Filtering: Exclude documents in similarity searches
- Validation: Ensure consistent structure (optional)
- Context: Store additional document information
Metadata Structure#
Format#
Metadata is JSON stored as JSONB in PostgreSQL:
{
"author": "William Shakespeare",
"title": "Hamlet",
"year": 1603,
"genre": "drama"
}Types#
Supported JSON types:
- String:
"author": "Shakespeare" - Number:
"year": 1603 - Boolean:
"published": true - Array:
"tags": ["tragedy", "revenge"] - Object:
"author": {"name": "...", "id": "..."} - Null:
"notes": null
Nested Structure#
Complex hierarchies are supported:
{
"document": {
"id": "W0017",
"type": "manuscript"
},
"author": {
"name": "John Milton",
"birth_year": 1608,
"nationality": "English"
},
"publication": {
"year": 1667,
"publisher": "First Edition",
"location": "London"
},
"tags": ["poetry", "epic", "religious"]
}Metadata Schemas#
Purpose#
JSON Schema validation ensures consistent metadata across all project embeddings.
Defining a Schema#
Include metadataScheme when creating/updating project:
POST /v1/projects/alice
{
"project_handle": "research",
"metadataScheme": "{\"type\":\"object\",\"properties\":{\"author\":{\"type\":\"string\"},\"year\":{\"type\":\"integer\"}},\"required\":[\"author\"]}"
}Schema Format#
Use JSON Schema (draft-07+):
{
"type": "object",
"properties": {
"author": {
"type": "string",
"minLength": 1
},
"year": {
"type": "integer",
"minimum": 1000,
"maximum": 2100
},
"genre": {
"type": "string",
"enum": ["poetry", "prose", "drama"]
}
},
"required": ["author", "year"]
}Validation Behavior#
With schema defined:
- All embeddings validated on upload
- Invalid metadata rejected with detailed error
- Schema enforced consistently
Without schema:
- Any JSON metadata accepted
- No validation performed
- Maximum flexibility
Common Patterns#
See Metadata Validation Guide for examples.
Using Metadata#
On Upload#
Include metadata with each embedding:
POST /v1/embeddings/alice/research
{
"embeddings": [
{
"text_id": "doc1",
"instance_handle": "my-embeddings",
"vector": [...],
"metadata": {
"author": "Shakespeare",
"title": "Hamlet",
"year": 1603
}
}
]
}In Responses#
Metadata returned when retrieving embeddings:
GET /v1/embeddings/alice/research/doc1{
"text_id": "doc1",
"metadata": {
"author": "Shakespeare",
"title": "Hamlet",
"year": 1603
},
"vector": [...],
...
}Metadata Filtering#
Exclusion Filter#
Exclude documents where metadata matches value:
GET /v1/similars/alice/research/doc1?metadata_path=author&metadata_value=ShakespeareResult: Returns similar documents excluding those with metadata.author == "Shakespeare".
Path Syntax#
Use JSON path notation:
Simple field:
metadata_path=authorNested field:
metadata_path=author.nameArray element (not currently supported):
metadata_path=tags[0]URL Encoding#
Encode special characters:
# Space
metadata_value=John%20Doe
# Quotes (if needed)
metadata_value=%22quoted%20value%22Use Cases#
Exclude same work:
?metadata_path=title&metadata_value=HamletExclude same author:
?metadata_path=author&metadata_value=ShakespeareExclude same source:
?metadata_path=source_id&metadata_value=corpus-aExclude same category:
?metadata_path=category&metadata_value=draftSee Metadata Filtering Guide for detailed examples.
Validation Examples#
Simple Schema#
{
"type": "object",
"properties": {
"author": {"type": "string"},
"year": {"type": "integer"}
},
"required": ["author"]
}Valid metadata:
{"author": "Shakespeare", "year": 1603}
{"author": "Milton"}Invalid metadata:
{"year": 1603} // Missing required 'author'
{"author": 123} // Wrong type (should be string)Schema with Constraints#
{
"type": "object",
"properties": {
"title": {
"type": "string",
"minLength": 1,
"maxLength": 200
},
"rating": {
"type": "number",
"minimum": 0,
"maximum": 5
},
"tags": {
"type": "array",
"items": {"type": "string"},
"minItems": 1,
"maxItems": 10
}
}
}Schema with Enums#
{
"type": "object",
"properties": {
"language": {
"type": "string",
"enum": ["en", "de", "fr", "es", "la"]
},
"status": {
"type": "string",
"enum": ["draft", "review", "published"]
}
}
}Storage and Performance#
Storage#
Metadata stored as JSONB in PostgreSQL:
- Efficient: Binary storage format
- Indexable: Can create indexes on fields
- Queryable: Use PostgreSQL JSON operators
Size Considerations#
Typical metadata sizes:
- Simple: 50-200 bytes
- Moderate: 200-1000 bytes
- Complex: 1-5KB
- Very large: >5KB (consider storing elsewhere)
Performance#
Metadata filtering:
- JSONB queries are efficient
- Add indexes for frequently filtered fields
- Keep metadata reasonably sized
Example index (if needed):
CREATE INDEX idx_embeddings_author
ON embeddings ((metadata->>'author'));Common Patterns#
Document Provenance#
Track document source and history:
{
"source": {
"corpus": "Shakespeare Works",
"collection": "Tragedies",
"document_id": "hamlet",
"version": 2
},
"imported_at": "2024-01-15T10:30:00Z",
"imported_by": "researcher1"
}Hierarchical Documents#
Structure for nested documents:
{
"work": "Paradise Lost",
"book": 1,
"line": 1,
"chapter": null,
"section": "Invocation"
}Multi-Language Content#
Track language and translation info:
{
"language": "en",
"original_language": "la",
"translated_by": "John Smith",
"translation_year": 1850
}Research Metadata#
Academic paper metadata:
{
"doi": "10.1234/example.2024.001",
"authors": ["Alice Smith", "Bob Jones"],
"journal": "Digital Humanities Review",
"year": 2024,
"keywords": ["NLP", "embeddings", "RAG"]
}Updating Metadata#
Current Limitation#
Metadata cannot be updated directly. To change:
- Delete embedding
- Re-upload with updated metadata
# Delete
DELETE /v1/embeddings/alice/project/doc1
# Re-upload with new metadata
POST /v1/embeddings/alice/project
{
"embeddings": [{
"text_id": "doc1",
"metadata": {...updated...},
...
}]
}Schema Updates#
Updating Project Schema#
Use PATCH to update schema:
PATCH /v1/projects/alice/research
{
"metadataScheme": "{...new schema...}"
}Effect on Existing Embeddings#
- Existing embeddings: Not revalidated
- New embeddings: Validated against new schema
- Updates: Validated against current schema
Migration Strategy#
When updating schema:
- Update project schema
- Verify new embeddings work
- Optionally re-upload existing embeddings
Validation Errors#
Common Errors#
Missing required field:
{
"status": 400,
"detail": "metadata validation failed: author is required"
}Wrong type:
{
"status": 400,
"detail": "metadata validation failed: year must be integer"
}Enum violation:
{
"status": 400,
"detail": "metadata validation failed: genre must be one of [poetry, prose, drama]"
}Debugging#
To debug validation errors:
- Check project schema:
GET /v1/projects/owner/project - Validate metadata with online tool: jsonschemavalidator.net
- Review error message for specific field
- Update metadata or schema as needed
Best Practices#
Schema Design#
- Start simple, add complexity as needed
- Use required fields for critical data
- Use enums for controlled vocabularies
- Document your schema
Metadata Content#
- Keep metadata focused and relevant
- Avoid redundant data
- Use consistent field names
- Consider future queries and filters
Performance#
- Keep metadata reasonably sized (<5KB)
- Index frequently queried fields
- Avoid deeply nested structures when possible
Troubleshooting#
Validation Fails#
Problem: Metadata doesn’t validate
Solutions:
- Check project schema
- Verify metadata structure
- Test with JSON Schema validator
- Review error message details
Filtering Not Working#
Problem: Metadata filter doesn’t exclude documents
Solutions:
- Verify field path is correct
- Check value matches exactly (case-sensitive)
- URL-encode special characters
- Confirm metadata field exists
Schema Too Restrictive#
Problem: Cannot upload valid documents
Solutions:
- Make fields optional (remove from
required) - Broaden type constraints
- Use
oneOffor multiple valid formats - Remove unnecessary validations