Database Schema Reference#

Complete reference for the embapi PostgreSQL database schema. This document describes all tables, columns, types, constraints, relationships, and indexes.

Overview#

The database uses PostgreSQL 12+ with the pgvector extension for vector similarity search. The schema is managed through migrations in internal/database/migrations/.

Key Features:

  • Vector embeddings stored as halfvec for efficient storage
  • HNSW indexes for fast approximate nearest neighbor search
  • Automatic timestamp tracking (created_at, updated_at)
  • Foreign key constraints with CASCADE deletion
  • Role-based access control through association tables
  • Multi-tenancy support (user-owned resources)

Schema Migrations#

Current schema version is defined by 4 migration files:

  1. 001_create_initial_scheme.sql - Core tables and relationships
  2. 002_create_emb_index.sql - HNSW vector indexes
  3. 003_add_public_read_flag.sql - Public access support
  4. 004_refactor_llm_services_architecture.sql - LLM service architecture refactor

Core Tables#

users#

Stores user accounts with API authentication.

ColumnTypeConstraintsDescription
user_handleVARCHAR(20)PRIMARY KEYUnique user identifier (username)
nameTEXTFull name or display name
emailTEXTUNIQUE, NOT NULLEmail address
embapi_keyCHAR(64)UNIQUE, NOT NULLAPI key (64-character hex)
created_atTIMESTAMPNOT NULLCreation timestamp
updated_atTIMESTAMPNOT NULLLast update timestamp

Indexes:

  • Primary key on user_handle
  • Unique constraint on email
  • Unique constraint on embapi_key

Special Users:

  • _system - Reserved system user for global LLM service definitions

Relationships:

  • Owns projects (1:N)
  • Owns LLM service instances (1:N)
  • Owns embeddings (1:N)
  • Can share projects (N:M via users_projects)
  • Can share LLM service instances (N:M via instances_shared_with)

Notes:

  • embapi_key is generated during user creation and returned once
  • Cannot be recovered if lost
  • Transmitted as Authorization: Bearer token

projects#

Stores embedding projects owned by users.

ColumnTypeConstraintsDescription
project_idSERIALPRIMARY KEYAuto-incrementing project ID
project_handleVARCHAR(20)NOT NULLProject identifier (unique per owner)
ownerVARCHAR(20)NOT NULL, FK→usersProject owner user handle
descriptionTEXTProject description
metadata_schemeTEXTJSON Schema for validating embedding metadata
public_readBOOLEANDEFAULT FALSEAllow unauthenticated read access
instance_idINTEGERFK→instancesLLM service instance used for embeddings
created_atTIMESTAMPNOT NULLCreation timestamp
updated_atTIMESTAMPNOT NULLLast update timestamp

Constraints:

  • UNIQUE (owner, project_handle) - Project handles unique per owner
  • ON DELETE CASCADE - Delete project when owner deleted
  • ON DELETE RESTRICT - Prevent instance deletion if used by project

Relationships:

  • Owned by one user (N:1)
  • Uses one LLM service instance (N:1)
  • Has many embeddings (1:N)
  • Can be shared with users (N:M via users_projects)

Metadata Schema:

  • Stored as JSON Schema string
  • Used to validate embedding metadata
  • Optional (can be NULL)
  • See Data Validation for details

Public Access:

  • When public_read=TRUE, allows unauthenticated embedding queries
  • Controlled via API or by setting shared_with to ["*"]

embeddings#

Stores vector embeddings with text identifiers and metadata.

ColumnTypeConstraintsDescription
embeddings_idSERIALPRIMARY KEYAuto-incrementing embedding ID
text_idTEXTINDEXEDText identifier (URL, DOI, custom ID)
ownerVARCHAR(20)NOT NULL, FK→usersEmbedding owner user handle
project_idSERIALNOT NULL, FK→projectsProject the embedding belongs to
instance_idSERIALNOT NULL, FK→instancesLLM service instance used
textTEXTOptional full text of the embedded content
vectorhalfvecNOT NULLEmbedding vector (half-precision float)
vector_dimINTEGERNOT NULLVector dimensionality
metadatajsonbOptional metadata (validated if schema defined)
created_atTIMESTAMPNOT NULLCreation timestamp
updated_atTIMESTAMPNOT NULLLast update timestamp

Constraints:

  • UNIQUE (text_id, owner, project_id, instance_id) - Unique text IDs per project/instance
  • ON DELETE CASCADE - Delete embeddings when owner, project, or instance deleted

Indexes:

  • Primary key on embeddings_id
  • B-tree index on text_id
  • HNSW vector indexes for dimensions: 384, 768, 1024, 1536, 3072 (see Vector Indexes)

Vector Storage:

  • Uses halfvec type (16-bit floating point) for efficient storage
  • Dimensions must match LLM service instance configuration
  • Validated on upload against instance dimensions

Relationships:

  • Owned by one user (N:1)
  • Belongs to one project (N:1)
  • Created with one LLM service instance (N:1)

LLM Service Architecture#

The LLM service architecture separates service definitions (templates) from user-specific instances.

definitions#

Templates for LLM embedding services (shared or user-specific).

ColumnTypeConstraintsDescription
definition_idSERIALPRIMARY KEYAuto-incrementing definition ID
definition_handleVARCHAR(20)NOT NULLDefinition identifier (unique per owner)
ownerVARCHAR(20)NOT NULL, FK→usersDefinition owner (_system for global)
endpointTEXTNOT NULLAPI endpoint URL
descriptionTEXTService description
api_standardVARCHAR(20)NOT NULL, FK→api_standardsAPI standard handle
modelTEXTNOT NULLModel name (e.g., text-embedding-3-large)
dimensionsINTEGERNOT NULLVector dimensions produced by model
context_limitINTEGERNOT NULLMaximum context length (tokens/chars)
is_publicBOOLEANNOT NULL, DEFAULT FALSEShare with all users if true
created_atTIMESTAMPNOT NULLCreation timestamp
updated_atTIMESTAMPNOT NULLLast update timestamp

Constraints:

  • UNIQUE (owner, definition_handle) - Definition handles unique per owner
  • ON DELETE CASCADE - Delete definition when owner deleted

Built-in Definitions:

System-provided definitions (owned by _system, is_public=TRUE):

HandleModelDimensionsContext LimitAPI Standard
openai-largetext-embedding-3-large30728192openai
openai-smalltext-embedding-3-small15368191openai
cohere-v4embed-v4.01536128000cohere
gemini-embedding-001gemini-embedding-00130722048gemini

Relationships:

  • Owned by one user (N:1)
  • References one API standard (N:1)
  • Can be shared with users (N:M via definitions_shared_with)
  • Used by instances (1:N)

instances#

User-specific configurations of LLM services with API keys.

ColumnTypeConstraintsDescription
instance_idSERIALPRIMARY KEYAuto-incrementing instance ID
instance_handleVARCHAR(20)NOT NULLInstance identifier (unique per owner)
ownerVARCHAR(20)NOT NULL, FK→usersInstance owner user handle
endpointTEXTNOT NULLAPI endpoint URL
descriptionTEXTInstance description
api_standardVARCHAR(20)NOT NULL, FK→api_standardsAPI standard handle
modelTEXTNOT NULLModel name
dimensionsINTEGERNOT NULLVector dimensions
context_limitINTEGERNOT NULLMaximum context length
definition_idINTEGERFK→definitionsReference to definition template
api_key_encryptedBYTEAEncrypted API key for service authentication
created_atTIMESTAMPNOT NULLCreation timestamp
updated_atTIMESTAMPNOT NULLLast update timestamp

Constraints:

  • UNIQUE (owner, instance_handle) - Instance handles unique per owner
  • ON DELETE CASCADE - Delete instance when owner deleted
  • ON DELETE SET NULL - Keep instance if definition deleted

Indexes:

  • Composite index on (owner, instance_handle)

API Key Encryption:

  • Encrypted using AES-256-GCM
  • Encryption key from ENCRYPTION_KEY environment variable
  • Cannot be recovered if encryption key is lost
  • Stored as base64-encoded BYTEA

Relationships:

  • Owned by one user (N:1)
  • References one API standard (N:1)
  • Based on one definition template (N:1, optional)
  • Used by projects (1:N)
  • Used by embeddings (1:N)
  • Can be shared with users (N:M via instances_shared_with)

api_standards#

Defines API specifications for LLM embedding services.

ColumnTypeConstraintsDescription
api_standard_handleVARCHAR(20)PRIMARY KEYStandard identifier (e.g., openai, cohere)
descriptionTEXTAPI description and version info
key_methodVARCHAR(20)NOT NULL, FK→key_methodsAuthentication method
key_fieldVARCHAR(20)Header/field name for API key
created_atTIMESTAMPNOT NULLCreation timestamp
updated_atTIMESTAMPNOT NULLLast update timestamp

Built-in Standards:

HandleDescriptionKey MethodKey Field
openaiOpenAI Embeddings API v1auth_bearerAuthorization
cohereCohere Embed API v2auth_bearerAuthorization
geminiGemini Embeddings APIauth_bearerx-goog-api-key

Relationships:

  • Used by definitions (1:N)
  • Used by instances (1:N)
  • References one key_method (N:1)

Association Tables#

users_projects#

Defines project sharing and access control.

ColumnTypeConstraintsDescription
user_handleVARCHAR(20)FK→users, PKUser being granted access
project_idSERIALFK→projects, PKProject being shared
roleVARCHAR(20)NOT NULL, FK→vdb_rolesAccess level (owner/editor/reader)
created_atTIMESTAMPNOT NULLShare creation timestamp
updated_atTIMESTAMPNOT NULLLast update timestamp

Constraints:

  • PRIMARY KEY (user_handle, project_id) - One role per user per project
  • ON DELETE CASCADE - Remove sharing when user or project deleted

Roles:

  • owner - Full control (only project creator)
  • editor - Can add/modify/delete embeddings
  • reader - Read-only access to embeddings

instances_shared_with#

Defines LLM service instance sharing.

ColumnTypeConstraintsDescription
user_handleVARCHAR(20)FK→users, PKUser being granted access
instance_idINTEGERFK→instances, PKInstance being shared
roleVARCHAR(20)NOT NULL, FK→vdb_rolesAccess level
created_atTIMESTAMPNOT NULLShare creation timestamp
updated_atTIMESTAMPNOT NULLLast update timestamp

Constraints:

  • PRIMARY KEY (user_handle, instance_id) - One role per user per instance
  • ON DELETE CASCADE - Remove sharing when user or instance deleted

definitions_shared_with#

Defines LLM service definition sharing.

ColumnTypeConstraintsDescription
user_handleVARCHAR(20)FK→users, PKUser being granted access
definition_idINTEGERFK→definitions, PKDefinition being shared
created_atTIMESTAMPNOT NULLShare creation timestamp
updated_atTIMESTAMPNOT NULLLast update timestamp

Constraints:

  • PRIMARY KEY (user_handle, definition_id) - One share per user per definition
  • ON DELETE CASCADE - Remove sharing when user or definition deleted

Indexes:

  • Index on user_handle for efficient user lookups
  • Index on definition_id for efficient definition lookups

Reference Tables#

vdb_roles#

Enumeration of valid access roles.

ColumnTypeConstraintsDescription
vdb_roleVARCHAR(20)PRIMARY KEYRole name

Values:

  • owner - Full control over resource
  • editor - Read and write access
  • reader - Read-only access

key_methods#

Enumeration of API authentication methods.

ColumnTypeConstraintsDescription
key_methodVARCHAR(20)PRIMARY KEYAuthentication method

Values:

  • auth_bearer - Bearer token in Authorization header
  • body_form - API key in request body
  • query_param - API key in URL query parameter
  • custom_header - API key in custom header

Vector Indexes#

HNSW (Hierarchical Navigable Small World) indexes for fast approximate nearest neighbor search.

Index Configuration#

All vector indexes use HNSW with these parameters:

  • m = 24 - Number of neighbors per node
  • ef_construction = 200 - Build-time accuracy parameter
  • ef_search = 100 - Query-time accuracy parameter
  • Expected recall: ~99.8%

Dimension-Specific Indexes#

Index NameDimensionsCommon Models
embeddings_vector_384384Cohere embed-multilingual-light-v3.0, embed-english-light-v3.0
embeddings_vector_768768BERT base, Cohere embed-multilingual-v2.0, Gemini Embeddings
embeddings_vector_10241024BERT large, SBERT, Cohere embed-multilingual-v3.0, embed-english-v3.0
embeddings_vector_15361536OpenAI text-embedding-ada-002, text-embedding-3-small, Cohere embed-v4.0
embeddings_vector_30723072OpenAI text-embedding-3-large, Gemini embedding-001

Index Structure:

CREATE INDEX embeddings_vector_1536 ON embeddings 
USING hnsw ((vector::halfvec(1536)) halfvec_cosine_ops) 
WITH (m = 24, ef_construction = 200) 
WHERE (vector_dim = 1536);

Usage:

  • Indexes are partial - only include vectors of matching dimension
  • Automatically used for similarity queries with matching dimensions
  • Use cosine distance for similarity calculation

Relationships Diagram#

users (1) ──owns──> (N) projects (1) ──contains──> (N) embeddings
  │                      │                              │
  │                      └─> uses (1) instances (N) <───┘
  │                                    │
  │                                    └─> based on (1) definitions
  │                                              │
  ├──owns──> (N) instances ──uses──> (1) api_standards
  │               │                          │
  │               └─> references (1) definitions
  │
  └──owns──> (N) definitions ──references──> (1) api_standards

Sharing:
  users (N) <──shares─> (M) projects (via users_projects)
  users (N) <──shares─> (M) instances (via instances_shared_with)
  users (N) <──shares─> (M) definitions (via definitions_shared_with)

Data Validation#

Dimension Validation#

Embeddings must have dimensions matching their LLM service instance:

  • embeddings.vector_dim must equal instances.dimensions
  • Enforced at API level during upload
  • Similarity queries automatically filter by matching dimensions

Metadata Validation#

Projects can define JSON Schema in metadata_scheme:

  • Validates all embedding metadata on upload
  • Optional - if NULL, metadata not validated
  • Enforced at API level
  • Admin sanity check validates all existing metadata

Sanity Check Queries#

The /v1/admin/sanity-check endpoint verifies:

  1. All embeddings have dimensions matching their instance
  2. All metadata conforms to project schemas (if defined)

Common Queries#

Find User’s Projects#

SELECT p.project_handle, p.description, p.created_at
FROM projects p
WHERE p.owner = 'alice'
ORDER BY p.created_at DESC;

Find Shared Projects#

SELECT p.project_handle, p.owner, up.role
FROM users_projects up
JOIN projects p ON up.project_id = p.project_id
WHERE up.user_handle = 'bob'
ORDER BY p.owner, p.project_handle;

Find Similar Embeddings#

SELECT text_id, vector <-> '[0.1, 0.2, ...]'::vector AS distance
FROM embeddings
WHERE project_id = 123 AND vector_dim = 1536
ORDER BY vector <-> '[0.1, 0.2, ...]'::vector
LIMIT 10;

Count Embeddings by Project#

SELECT p.project_handle, COUNT(e.embeddings_id) AS embedding_count
FROM projects p
LEFT JOIN embeddings e ON p.project_id = e.project_id
WHERE p.owner = 'alice'
GROUP BY p.project_id, p.project_handle
ORDER BY embedding_count DESC;

Database Maintenance#

Backup Recommendations#

Critical Data:

  • User accounts (users)
  • API keys (embapi_key in users, api_key_encrypted in instances)
  • ENCRYPTION_KEY environment variable (backup separately!)
  • Projects and embeddings

Backup Strategy:

# Full database backup
pg_dump -U postgres embapi > backup.sql

# Backup encryption key separately
echo "$ENCRYPTION_KEY" > encryption_key.backup
chmod 400 encryption_key.backup

Vacuum and Analyze#

-- Regular maintenance
VACUUM ANALYZE embeddings;
VACUUM ANALYZE projects;

-- Rebuild vector indexes if needed
REINDEX INDEX embeddings_vector_1536;

Monitoring Queries#

-- Check index usage
SELECT schemaname, tablename, indexname, idx_scan
FROM pg_stat_user_indexes
WHERE tablename = 'embeddings'
ORDER BY idx_scan DESC;

-- Check table sizes
SELECT relname, pg_size_pretty(pg_total_relation_size(oid))
FROM pg_class
WHERE relname IN ('embeddings', 'projects', 'users')
ORDER BY pg_total_relation_size(oid) DESC;