Metadata#

Structured JSON data attached to embeddings for organization, validation, and filtering.

Overview#

Metadata provides context and structure for your embeddings:

Organization: Categorize and group documents
Filtering: Exclude documents in similarity searches
Validation: Ensure consistent structure (optional)
Context: Store additional document information

Metadata Structure#

Format#

Metadata is JSON stored as JSONB in PostgreSQL:

{
  "author": "William Shakespeare",
  "title": "Hamlet",
  "year": 1603,
  "genre": "drama"
}

Types#

Supported JSON types:

String: "author": "Shakespeare"
Number: "year": 1603
Boolean: "published": true
Array: "tags": ["tragedy", "revenge"]
Object: "author": {"name": "...", "id": "..."}
Null: "notes": null

Nested Structure#

Complex hierarchies are supported:

{
  "document": {
    "id": "W0017",
    "type": "manuscript"
  },
  "author": {
    "name": "John Milton",
    "birth_year": 1608,
    "nationality": "English"
  },
  "publication": {
    "year": 1667,
    "publisher": "First Edition",
    "location": "London"
  },
  "tags": ["poetry", "epic", "religious"]
}

Metadata Schemas#

Purpose#

JSON Schema validation ensures consistent metadata across all project embeddings.

Defining a Schema#

Include metadataScheme when creating/updating project:

POST /v1/projects/alice

{
  "project_handle": "research",
  "metadataScheme": "{\"type\":\"object\",\"properties\":{\"author\":{\"type\":\"string\"},\"year\":{\"type\":\"integer\"}},\"required\":[\"author\"]}"
}

Schema Format#

Use JSON Schema (draft-07+):

{
  "type": "object",
  "properties": {
    "author": {
      "type": "string",
      "minLength": 1
    },
    "year": {
      "type": "integer",
      "minimum": 1000,
      "maximum": 2100
    },
    "genre": {
      "type": "string",
      "enum": ["poetry", "prose", "drama"]
    }
  },
  "required": ["author", "year"]
}

Validation Behavior#

With schema defined:

All embeddings validated on upload
Invalid metadata rejected with detailed error
Schema enforced consistently

Without schema:

Any JSON metadata accepted
No validation performed
Maximum flexibility

Common Patterns#

See Metadata Validation Guide for examples.

Using Metadata#

On Upload#

Include metadata with each embedding:

POST /v1/embeddings/alice/research

{
  "embeddings": [
    {
      "text_id": "doc1",
      "instance_handle": "my-embeddings",
      "vector": [...],
      "metadata": {
        "author": "Shakespeare",
        "title": "Hamlet",
        "year": 1603
      }
    }
  ]
}

In Responses#

Metadata returned when retrieving embeddings:

GET /v1/embeddings/alice/research/doc1

{
  "text_id": "doc1",
  "metadata": {
    "author": "Shakespeare",
    "title": "Hamlet",
    "year": 1603
  },
  "vector": [...],
  ...
}

Metadata Filtering#

Exclusion Filter#

Exclude documents where metadata matches value:

GET /v1/similars/alice/research/doc1?metadata_path=author&metadata_value=Shakespeare

Result: Returns similar documents excluding those with metadata.author == "Shakespeare".

Path Syntax#

Use JSON path notation:

Simple field:

metadata_path=author

Nested field:

metadata_path=author.name

Array element (not currently supported):

metadata_path=tags[0]

URL Encoding#

Encode special characters:

# Space
metadata_value=John%20Doe

# Quotes (if needed)
metadata_value=%22quoted%20value%22

Use Cases#

Exclude same work:

?metadata_path=title&metadata_value=Hamlet

Exclude same author:

?metadata_path=author&metadata_value=Shakespeare

Exclude same source:

?metadata_path=source_id&metadata_value=corpus-a

Exclude same category:

?metadata_path=category&metadata_value=draft

See Metadata Filtering Guide for detailed examples.

Validation Examples#

Simple Schema#

{
  "type": "object",
  "properties": {
    "author": {"type": "string"},
    "year": {"type": "integer"}
  },
  "required": ["author"]
}

Valid metadata:

{"author": "Shakespeare", "year": 1603}
{"author": "Milton"}

Invalid metadata:

{"year": 1603}  // Missing required 'author'
{"author": 123}  // Wrong type (should be string)

Schema with Constraints#

{
  "type": "object",
  "properties": {
    "title": {
      "type": "string",
      "minLength": 1,
      "maxLength": 200
    },
    "rating": {
      "type": "number",
      "minimum": 0,
      "maximum": 5
    },
    "tags": {
      "type": "array",
      "items": {"type": "string"},
      "minItems": 1,
      "maxItems": 10
    }
  }
}

Schema with Enums#

{
  "type": "object",
  "properties": {
    "language": {
      "type": "string",
      "enum": ["en", "de", "fr", "es", "la"]
    },
    "status": {
      "type": "string",
      "enum": ["draft", "review", "published"]
    }
  }
}

Storage and Performance#

Storage#

Metadata stored as JSONB in PostgreSQL:

Efficient: Binary storage format
Indexable: Can create indexes on fields
Queryable: Use PostgreSQL JSON operators

Size Considerations#

Typical metadata sizes:

Simple: 50-200 bytes
Moderate: 200-1000 bytes
Complex: 1-5KB
Very large: >5KB (consider storing elsewhere)

Performance#

Metadata filtering:

JSONB queries are efficient
Add indexes for frequently filtered fields
Keep metadata reasonably sized

Example index (if needed):

CREATE INDEX idx_embeddings_author 
ON embeddings ((metadata->>'author'));

Common Patterns#

Document Provenance#

Track document source and history:

{
  "source": {
    "corpus": "Shakespeare Works",
    "collection": "Tragedies",
    "document_id": "hamlet",
    "version": 2
  },
  "imported_at": "2024-01-15T10:30:00Z",
  "imported_by": "researcher1"
}

Hierarchical Documents#

Structure for nested documents:

{
  "work": "Paradise Lost",
  "book": 1,
  "line": 1,
  "chapter": null,
  "section": "Invocation"
}

Multi-Language Content#

Track language and translation info:

{
  "language": "en",
  "original_language": "la",
  "translated_by": "John Smith",
  "translation_year": 1850
}

Research Metadata#

Academic paper metadata:

{
  "doi": "10.1234/example.2024.001",
  "authors": ["Alice Smith", "Bob Jones"],
  "journal": "Digital Humanities Review",
  "year": 2024,
  "keywords": ["NLP", "embeddings", "RAG"]
}

Updating Metadata#

Current Limitation#

Metadata cannot be updated directly. To change:

Delete embedding
Re-upload with updated metadata

# Delete
DELETE /v1/embeddings/alice/project/doc1

# Re-upload with new metadata
POST /v1/embeddings/alice/project
{
  "embeddings": [{
    "text_id": "doc1",
    "metadata": {...updated...},
    ...
  }]
}

Schema Updates#

Updating Project Schema#

Use PATCH to update schema:

PATCH /v1/projects/alice/research

{
  "metadataScheme": "{...new schema...}"
}

Effect on Existing Embeddings#

Existing embeddings: Not revalidated
New embeddings: Validated against new schema
Updates: Validated against current schema

Migration Strategy#

When updating schema:

Update project schema
Verify new embeddings work
Optionally re-upload existing embeddings

Validation Errors#

Common Errors#

Missing required field:

{
  "status": 400,
  "detail": "metadata validation failed: author is required"
}

Wrong type:

{
  "status": 400,
  "detail": "metadata validation failed: year must be integer"
}

Enum violation:

{
  "status": 400,
  "detail": "metadata validation failed: genre must be one of [poetry, prose, drama]"
}

Debugging#

To debug validation errors:

Check project schema: GET /v1/projects/owner/project
Validate metadata with online tool: jsonschemavalidator.net
Review error message for specific field
Update metadata or schema as needed

Best Practices#

Schema Design#

Start simple, add complexity as needed
Use required fields for critical data
Use enums for controlled vocabularies
Document your schema

Metadata Content#

Keep metadata focused and relevant
Avoid redundant data
Use consistent field names
Consider future queries and filters

Performance#

Keep metadata reasonably sized (<5KB)
Index frequently queried fields
Avoid deeply nested structures when possible

Troubleshooting#

Validation Fails#

Problem: Metadata doesn’t validate

Solutions:

Check project schema
Verify metadata structure
Test with JSON Schema validator
Review error message details

Filtering Not Working#

Problem: Metadata filter doesn’t exclude documents

Solutions:

Verify field path is correct
Check value matches exactly (case-sensitive)
URL-encode special characters
Confirm metadata field exists

Schema Too Restrictive#

Problem: Cannot upload valid documents

Solutions:

Make fields optional (remove from required)
Broaden type constraints
Use oneOf for multiple valid formats
Remove unnecessary validations

Metadata#

Overview#

Metadata Structure#

Format#

Types#

Nested Structure#

Metadata Schemas#

Purpose#

Defining a Schema#

Schema Format#

Validation Behavior#

Common Patterns#

Using Metadata#

On Upload#

In Responses#

Metadata Filtering#

Exclusion Filter#

Path Syntax#

URL Encoding#

Use Cases#

Validation Examples#

Simple Schema#

Schema with Constraints#

Schema with Enums#

Storage and Performance#

Storage#

Size Considerations#

Performance#

Common Patterns#

Document Provenance#

Hierarchical Documents#

Multi-Language Content#

Research Metadata#

Updating Metadata#

Current Limitation#

Schema Updates#

Updating Project Schema#

Effect on Existing Embeddings#

Migration Strategy#

Validation Errors#

Common Errors#

Debugging#

Best Practices#

Schema Design#

Metadata Content#

Performance#

Troubleshooting#

Validation Fails#

Filtering Not Working#

Schema Too Restrictive#

Next Steps#