Skip to content

Extensibility

NLQL is designed to be highly extensible. You can customize almost every aspect of query execution.

Registration Levels

NLQL supports two levels of registration:

  1. Global Registration: Functions/operators/providers are registered globally and available to all NLQL instances
  2. Instance-Level Registration: Functions/operators/providers are registered to a specific NLQL instance only

Instance-level registrations take precedence over global registrations, allowing you to override global behavior for specific instances.

Custom Operators

Register domain-specific operators:

from nlql import register_operator
import re

@register_operator("HAS_EMAIL")
def has_email(text: str) -> bool:
    """Check if text contains an email address."""
    pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
    return bool(re.search(pattern, text))

@register_operator("HAS_URL")
def has_url(text: str) -> bool:
    """Check if text contains a URL."""
    pattern = r'https?://[^\s]+'
    return bool(re.search(pattern, text))

Use in queries:

SELECT CHUNK
WHERE HAS_EMAIL(content) AND HAS_URL(content)

Custom Functions

Add query functions:

from nlql import register_function

@register_function("word_count")
def word_count(text: str) -> int:
    """Count words in text."""
    return len(text.split())

@register_function("days_ago")
def days_ago(days: int) -> str:
    """Get date N days ago."""
    from datetime import datetime, timedelta
    date = datetime.now() - timedelta(days=days)
    return date.strftime("%Y-%m-%d")

Use in queries:

SELECT CHUNK
WHERE word_count(content) > 100
  AND META("date") > days_ago(7)

Instance-Level Registration

You can register functions, operators, and embedding providers to specific NLQL instances instead of globally. This is useful for:

  • Multi-tenant applications with different business logic per tenant
  • A/B testing different implementations
  • Isolating test environments from production
  • Domain-specific query engines with specialized functions

Example: Different Function Implementations per Instance

from nlql import NLQL
from nlql.adapters import MemoryAdapter

# Create two NLQL instances
nlql1 = NLQL(adapter=adapter1)
nlql2 = NLQL(adapter=adapter2)

# Register different implementations to each instance
@nlql1.register_function("WORD_COUNT")
def word_count_total(text: str) -> int:
    """Count total words."""
    return len(text.split())

@nlql2.register_function("WORD_COUNT")
def word_count_unique(text: str) -> int:
    """Count unique words."""
    return len(set(text.lower().split()))

# Each instance uses its own implementation
results1 = nlql1.execute("SELECT CHUNK WHERE WORD_COUNT(content) > 10")
results2 = nlql2.execute("SELECT CHUNK WHERE WORD_COUNT(content) > 10")

Example: Different Operator Implementations per Instance

# Register different operators to each instance
@nlql1.register_operator("CUSTOM_FILTER")
def filter_python(text: str) -> bool:
    return "Python" in text

@nlql2.register_operator("CUSTOM_FILTER")
def filter_ai(text: str) -> bool:
    return "AI" in text

# Each instance uses its own operator
results1 = nlql1.execute("SELECT CHUNK WHERE CUSTOM_FILTER(content)")
results2 = nlql2.execute("SELECT CHUNK WHERE CUSTOM_FILTER(content)")

Example: Different Embedding Providers per Instance

# Register different embedding providers to each instance
@nlql1.register_embedding_provider
def embedding_word_based(texts: list[str]) -> list[list[float]]:
    return [[len(text.split()) / 10.0, 0.5, 0.5] for text in texts]

@nlql2.register_embedding_provider
def embedding_char_based(texts: list[str]) -> list[list[float]]:
    return [[len(text) / 50.0, 0.5, 0.5] for text in texts]

# Each instance uses its own embedding provider
results1 = nlql1.execute('SELECT CHUNK WHERE SIMILAR_TO("query") > 0.5')
results2 = nlql2.execute('SELECT CHUNK WHERE SIMILAR_TO("query") > 0.5')

Priority Rules: - Instance-level registrations take precedence over global registrations - If a function/operator is registered both globally and to an instance, the instance-level version is used - Instance-level registrations do not affect the global registry or other instances

See the examples/instance_registry_demo.py file in the repository for a complete working example.

Custom Types

Define metadata field types for type-safe comparisons:

from nlql import register_meta_field
from nlql.types import BaseType, NumberType, DateType, TextType

# Register built-in types
register_meta_field("score", NumberType)
register_meta_field("created_at", DateType)
register_meta_field("status", TextType)

# Create custom type
class PriorityType(BaseType):
    """Custom priority type with special comparison logic."""

    LEVELS = {"low": 1, "medium": 2, "high": 3, "critical": 4}

    def __init__(self, value: str | int):
        if isinstance(value, str):
            value = self.LEVELS.get(value.lower(), 0)
        super().__init__(value)

    def __lt__(self, other):
        other_val = other.value if isinstance(other, BaseType) else other
        return self.value < other_val

    def __gt__(self, other):
        other_val = other.value if isinstance(other, BaseType) else other
        return self.value > other_val

    def __eq__(self, other):
        other_val = other.value if isinstance(other, BaseType) else other
        return self.value == other_val

# Register custom type
register_meta_field("priority", PriorityType)

Use in queries:

SELECT CHUNK
WHERE META("priority") > "medium"

Custom Splitters

Implement language-specific or domain-specific text splitting:

from nlql import register_splitter

@register_splitter("SENTENCE")
def german_sentence_splitter(text: str) -> list[str]:
    """Split German text into sentences."""
    import nltk
    return nltk.sent_tokenize(text, language='german')

@register_splitter("PARAGRAPH")
def paragraph_splitter(text: str) -> list[str]:
    """Split text into paragraphs."""
    return [p.strip() for p in text.split('\n\n') if p.strip()]

Use in queries:

-- Uses custom German sentence splitter
SELECT SENTENCE
WHERE SIMILAR_TO("Künstliche Intelligenz")

Custom Embedding Provider

NLQL uses embedding providers for semantic search (SIMILAR_TO operator). You can customize the embedding model used for vectorization.

Default Provider

By default, NLQL uses sentence-transformers with the all-MiniLM-L6-v2 model:

from nlql import NLQL
from nlql.adapters import MemoryAdapter

# Uses default embedding provider automatically
adapter = MemoryAdapter()
nlql = NLQL(adapter=adapter)

# SIMILAR_TO will use all-MiniLM-L6-v2
results = nlql.execute('SELECT CHUNK WHERE SIMILAR_TO("AI") > 0.7')

Custom Provider with OpenAI

Use OpenAI's embedding API:

from nlql.registry.embedding import register_embedding_provider

# Use decorator syntax (recommended)
@register_embedding_provider
def openai_embedding_provider(texts: list[str]) -> list[list[float]]:
    """Generate embeddings using OpenAI API."""
    import openai

    # Configure your API key
    openai.api_key = "your-api-key"

    response = openai.Embedding.create(
        input=texts,
        model="text-embedding-ada-002"
    )

    return [item["embedding"] for item in response["data"]]

# Now SIMILAR_TO will use OpenAI embeddings
nlql = NLQL(adapter=adapter)
results = nlql.execute('SELECT CHUNK WHERE SIMILAR_TO("AI") > 0.7')

Note: Embedding providers must be functions with signature (list[str]) -> list[list[float]]. They receive a batch of texts and return a batch of embedding vectors. You can use either decorator syntax (@register_embedding_provider) or function call syntax (register_embedding_provider(my_func)).

Custom Provider with Different Sentence-Transformers Model

Use a different sentence-transformers model:

from nlql.registry.embedding import register_embedding_provider
from sentence_transformers import SentenceTransformer

# Load model once (lazy loading)
_model = None

@register_embedding_provider
def custom_embedding_provider(texts: list[str]) -> list[list[float]]:
    """Generate embeddings using a different sentence-transformers model."""
    global _model
    if _model is None:
        _model = SentenceTransformer("all-mpnet-base-v2")

    embeddings = _model.encode(texts, convert_to_numpy=True)
    return embeddings.tolist()

Custom Provider with Hugging Face Models

Use any Hugging Face model:

from nlql.registry.embedding import register_embedding_provider
from transformers import AutoTokenizer, AutoModel
import torch

# Load model once (lazy loading)
_tokenizer = None
_model = None

@register_embedding_provider
def huggingface_embedding_provider(texts: list[str]) -> list[list[float]]:
    """Generate embeddings using Hugging Face transformers."""
    global _tokenizer, _model

    if _tokenizer is None or _model is None:
        _tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
        _model = AutoModel.from_pretrained("bert-base-uncased")

    # Tokenize
    encoded = _tokenizer(
        texts,
        padding=True,
        truncation=True,
        return_tensors="pt"
    )

    # Get model output
    with torch.no_grad():
        output = _model(**encoded)

    # Mean pooling
    embeddings = output.last_hidden_state.mean(dim=1)

    return embeddings.tolist()

Provider Interface

All embedding providers must be functions with this signature:

def embedding_provider(texts: list[str]) -> list[list[float]]:
    """Generate embeddings for a list of texts.

    Args:
        texts: List of text strings to embed

    Returns:
        List of embedding vectors (each vector is a list of floats)

    Example:
        >>> embedding_provider(["hello", "world"])
        [[0.1, 0.2, 0.3], [0.4, 0.5, 0.6]]
    """
    # Your embedding logic here
    pass

Best Practices

  1. Lazy Loading: Load models only when needed to save memory:

    _model = None
    
    def my_embedding_provider(texts: list[str]) -> list[list[float]]:
        global _model
        if _model is None:
            _model = load_expensive_model()
        return _model.encode(texts)
    

  2. Batch Processing: Process multiple texts at once for efficiency:

    def my_embedding_provider(texts: list[str]) -> list[list[float]]:
        # Process all texts in one batch (not one by one)
        return model.encode(texts)
    

  3. Error Handling: Handle errors gracefully:

    from nlql.errors import NLQLConfigError
    
    def my_embedding_provider(texts: list[str]) -> list[list[float]]:
        try:
            return model.encode(texts)
        except Exception as e:
            raise NLQLConfigError(f"Embedding failed: {e}") from e
    

  4. Caching: Consider caching embeddings for frequently used texts:

    from nlql.registry.embedding import register_embedding_provider
    
    # Cache for storing embeddings
    _cache = {}
    _base_model = None
    
    @register_embedding_provider
    def cached_embedding_provider(texts: list[str]) -> list[list[float]]:
        """Embedding provider with caching."""
        global _cache, _base_model
    
        if _base_model is None:
            from sentence_transformers import SentenceTransformer
            _base_model = SentenceTransformer("all-MiniLM-L6-v2")
    
        results = []
        uncached = []
        uncached_indices = []
    
        for i, text in enumerate(texts):
            if text in _cache:
                results.append(_cache[text])
            else:
                uncached.append(text)
                uncached_indices.append(i)
    
        if uncached:
            new_embeddings = _base_model.encode(uncached).tolist()
            for text, emb in zip(uncached, new_embeddings):
                _cache[text] = emb
                results.append(emb)
    
        return results
    

Custom Adapters

Create adapters for new data sources:

from nlql.adapters import BaseAdapter, QueryPlan
from nlql.text.units import Chunk, TextUnit

class ElasticsearchAdapter(BaseAdapter):
    """Adapter for Elasticsearch."""

    def __init__(self, es_client, index_name: str):
        self.client = es_client
        self.index = index_name

    def query(self, plan: QueryPlan) -> list[TextUnit]:
        # Build Elasticsearch query
        es_query = {"bool": {"must": []}}

        # Add metadata filters
        if plan.filters:
            for field, value in plan.filters.items():
                es_query["bool"]["must"].append({
                    "term": {field: value}
                })

        # Add semantic search (if using vector field)
        if plan.query_text:
            # Implement vector search
            pass

        # Execute query
        response = self.client.search(
            index=self.index,
            query=es_query,
            size=plan.limit or 10,
        )

        # Convert to TextUnit
        results = []
        for hit in response["hits"]["hits"]:
            chunk = Chunk(
                content=hit["_source"]["content"],
                metadata=hit["_source"].get("metadata", {}),
                chunk_id=hit["_id"],
                position=0,
            )
            results.append(chunk)

        return results

    def supports_semantic_search(self) -> bool:
        return True  # If you have vector fields

    def supports_metadata_filter(self) -> bool:
        return True

Configuration

Customize NLQL behavior with NLQLConfig:

from nlql import NLQL, NLQLConfig
from nlql.adapters import MemoryAdapter

# Create configuration
config = NLQLConfig(
    default_limit=100,  # Default LIMIT when query doesn't specify one
    debug_mode=True,    # Enable debug logging
)

# Create NLQL instance with config
adapter = MemoryAdapter()
nlql = NLQL(adapter=adapter, config=config)

# Query without LIMIT will use default_limit=100
results = nlql.execute("SELECT CHUNK WHERE CONTAINS('AI')")

Available Configuration Options:

  • default_limit (int | None): Default LIMIT value when query doesn't specify one. Default: None (no limit)
  • debug_mode (bool): Enable debug logging for query execution steps. Default: False
  • enable_caching (bool): Reserved for future caching implementation. Default: False
  • custom_settings (dict): Reserved for future extensibility. Default: {}

Best Practices

1. Naming Conventions

  • Operators: UPPERCASE (e.g., HAS_EMAIL, CONTAINS_CODE)
  • Functions: lowercase (e.g., word_count, days_ago)
  • Types: PascalCase (e.g., PriorityType, CustomDateType)

2. Type Hints

Always use type hints for better IDE support:

@register_function("my_func")
def my_func(text: str, threshold: int) -> bool:
    ...

3. Documentation

Document custom extensions:

@register_operator("CUSTOM_OP")
def custom_op(text: str) -> bool:
    """Check if text matches custom criteria.

    Args:
        text: Input text to check

    Returns:
        True if criteria is met
    """
    ...

4. Error Handling

Handle errors gracefully:

from nlql.errors import NLQLExecutionError

@register_function("safe_func")
def safe_func(value: str) -> int:
    try:
        return int(value)
    except ValueError as e:
        raise NLQLExecutionError(f"Cannot convert '{value}' to int") from e

Complete Example

For a comprehensive demonstration of all extensibility features, see the examples/extensibility_demo.py file in the repository.

This example shows:

  • Custom Functions: WORD_COUNT(), UPPERCASE(), EXTRACT_YEAR()
  • Custom Operators: STARTS_WITH(), HAS_DIGIT(), REGEX_MATCH()
  • Custom Embedding Provider: Simple statistics-based embedding
  • Integration: How custom extensions work with built-in NLQL features

Run the example:

python examples/extensibility_demo.py

Common Issues and Solutions

1. Function/Operator Name Conflicts

Problem: Custom function names containing built-in keywords (like COUNT, IS) may cause parsing errors.

Solution: Avoid using built-in keywords as prefixes in custom names:

# ❌ Bad - contains built-in keyword "COUNT"
@register_function("COUNTWORDS")
def count_words(text: str) -> int:
    return len(text.split())

# ✅ Good - no keyword conflicts
@register_function("NUMWORDS")
def count_words(text: str) -> int:
    return len(text.split())

Built-in keywords to avoid: - Functions: LENGTH, NOW, COUNT - Operators: MATCH, SIMILAR_TO, CONTAINS, IS, META

2. Custom Operators Must Be Uppercase

Problem: Lowercase operator names cause registration errors.

Solution: Always use UPPERCASE names for operators:

# ❌ Bad - lowercase name
@register_operator("starts_with")
def starts_with(text: str, prefix: str) -> bool:
    return text.startswith(prefix)

# ✅ Good - uppercase name
@register_operator("STARTS_WITH")
def starts_with(text: str, prefix: str) -> bool:
    return text.startswith(prefix)

3. Embedding Provider Signature

Problem: Custom embedding provider has wrong signature.

Solution: Embedding providers must accept list[str] and return list[list[float]]:

# ❌ Bad - wrong signature (single text)
def my_embedding(text: str) -> list[float]:
    return [0.1, 0.2, 0.3]

# ✅ Good - correct signature (batch processing)
def my_embedding(texts: list[str]) -> list[list[float]]:
    return [[0.1, 0.2, 0.3] for _ in texts]

4. Handling None Values in Functions

Problem: Functions returning None cause comparison errors.

Solution: Return a default value instead of None:

# ❌ Bad - returns None
@register_function("EXTRACT_YEAR")
def extract_year(text: str) -> int | None:
    match = re.search(r'\b(19|20)\d{2}\b', text)
    return int(match.group()) if match else None

# ✅ Good - returns default value
@register_function("EXTRACT_YEAR")
def extract_year(text: str) -> int:
    match = re.search(r'\b(19|20)\d{2}\b', text)
    return int(match.group()) if match else 0

5. Operator Argument Evaluation

Problem: Operator receives unevaluated AST nodes instead of values.

Solution: The evaluator automatically evaluates arguments before passing them to operators. Just define your operator with the expected types:

@register_operator("STARTS_WITH")
def starts_with(text: str, prefix: str) -> bool:
    # text and prefix are already evaluated strings
    return text.startswith(prefix)

Next Steps

  • Check API Reference for detailed API docs
  • See Architecture for system design
  • Run the example files in the examples/ directory for hands-on learning