Extensibility¶
NLQL is designed to be highly extensible. You can customize almost every aspect of query execution.
Registration Levels¶
NLQL supports two levels of registration:
- Global Registration: Functions/operators/providers are registered globally and available to all NLQL instances
- Instance-Level Registration: Functions/operators/providers are registered to a specific NLQL instance only
Instance-level registrations take precedence over global registrations, allowing you to override global behavior for specific instances.
Custom Operators¶
Register domain-specific operators:
from nlql import register_operator
import re
@register_operator("HAS_EMAIL")
def has_email(text: str) -> bool:
"""Check if text contains an email address."""
pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
return bool(re.search(pattern, text))
@register_operator("HAS_URL")
def has_url(text: str) -> bool:
"""Check if text contains a URL."""
pattern = r'https?://[^\s]+'
return bool(re.search(pattern, text))
Use in queries:
Custom Functions¶
Add query functions:
from nlql import register_function
@register_function("word_count")
def word_count(text: str) -> int:
"""Count words in text."""
return len(text.split())
@register_function("days_ago")
def days_ago(days: int) -> str:
"""Get date N days ago."""
from datetime import datetime, timedelta
date = datetime.now() - timedelta(days=days)
return date.strftime("%Y-%m-%d")
Use in queries:
Instance-Level Registration¶
You can register functions, operators, and embedding providers to specific NLQL instances instead of globally. This is useful for:
- Multi-tenant applications with different business logic per tenant
- A/B testing different implementations
- Isolating test environments from production
- Domain-specific query engines with specialized functions
Example: Different Function Implementations per Instance
from nlql import NLQL
from nlql.adapters import MemoryAdapter
# Create two NLQL instances
nlql1 = NLQL(adapter=adapter1)
nlql2 = NLQL(adapter=adapter2)
# Register different implementations to each instance
@nlql1.register_function("WORD_COUNT")
def word_count_total(text: str) -> int:
"""Count total words."""
return len(text.split())
@nlql2.register_function("WORD_COUNT")
def word_count_unique(text: str) -> int:
"""Count unique words."""
return len(set(text.lower().split()))
# Each instance uses its own implementation
results1 = nlql1.execute("SELECT CHUNK WHERE WORD_COUNT(content) > 10")
results2 = nlql2.execute("SELECT CHUNK WHERE WORD_COUNT(content) > 10")
Example: Different Operator Implementations per Instance
# Register different operators to each instance
@nlql1.register_operator("CUSTOM_FILTER")
def filter_python(text: str) -> bool:
return "Python" in text
@nlql2.register_operator("CUSTOM_FILTER")
def filter_ai(text: str) -> bool:
return "AI" in text
# Each instance uses its own operator
results1 = nlql1.execute("SELECT CHUNK WHERE CUSTOM_FILTER(content)")
results2 = nlql2.execute("SELECT CHUNK WHERE CUSTOM_FILTER(content)")
Example: Different Embedding Providers per Instance
# Register different embedding providers to each instance
@nlql1.register_embedding_provider
def embedding_word_based(texts: list[str]) -> list[list[float]]:
return [[len(text.split()) / 10.0, 0.5, 0.5] for text in texts]
@nlql2.register_embedding_provider
def embedding_char_based(texts: list[str]) -> list[list[float]]:
return [[len(text) / 50.0, 0.5, 0.5] for text in texts]
# Each instance uses its own embedding provider
results1 = nlql1.execute('SELECT CHUNK WHERE SIMILAR_TO("query") > 0.5')
results2 = nlql2.execute('SELECT CHUNK WHERE SIMILAR_TO("query") > 0.5')
Priority Rules: - Instance-level registrations take precedence over global registrations - If a function/operator is registered both globally and to an instance, the instance-level version is used - Instance-level registrations do not affect the global registry or other instances
See the examples/instance_registry_demo.py file in the repository for a complete working example.
Custom Types¶
Define metadata field types for type-safe comparisons:
from nlql import register_meta_field
from nlql.types import BaseType, NumberType, DateType, TextType
# Register built-in types
register_meta_field("score", NumberType)
register_meta_field("created_at", DateType)
register_meta_field("status", TextType)
# Create custom type
class PriorityType(BaseType):
"""Custom priority type with special comparison logic."""
LEVELS = {"low": 1, "medium": 2, "high": 3, "critical": 4}
def __init__(self, value: str | int):
if isinstance(value, str):
value = self.LEVELS.get(value.lower(), 0)
super().__init__(value)
def __lt__(self, other):
other_val = other.value if isinstance(other, BaseType) else other
return self.value < other_val
def __gt__(self, other):
other_val = other.value if isinstance(other, BaseType) else other
return self.value > other_val
def __eq__(self, other):
other_val = other.value if isinstance(other, BaseType) else other
return self.value == other_val
# Register custom type
register_meta_field("priority", PriorityType)
Use in queries:
Custom Splitters¶
Implement language-specific or domain-specific text splitting:
from nlql import register_splitter
@register_splitter("SENTENCE")
def german_sentence_splitter(text: str) -> list[str]:
"""Split German text into sentences."""
import nltk
return nltk.sent_tokenize(text, language='german')
@register_splitter("PARAGRAPH")
def paragraph_splitter(text: str) -> list[str]:
"""Split text into paragraphs."""
return [p.strip() for p in text.split('\n\n') if p.strip()]
Use in queries:
Custom Embedding Provider¶
NLQL uses embedding providers for semantic search (SIMILAR_TO operator). You can customize the embedding model used for vectorization.
Default Provider¶
By default, NLQL uses sentence-transformers with the all-MiniLM-L6-v2 model:
from nlql import NLQL
from nlql.adapters import MemoryAdapter
# Uses default embedding provider automatically
adapter = MemoryAdapter()
nlql = NLQL(adapter=adapter)
# SIMILAR_TO will use all-MiniLM-L6-v2
results = nlql.execute('SELECT CHUNK WHERE SIMILAR_TO("AI") > 0.7')
Custom Provider with OpenAI¶
Use OpenAI's embedding API:
from nlql.registry.embedding import register_embedding_provider
# Use decorator syntax (recommended)
@register_embedding_provider
def openai_embedding_provider(texts: list[str]) -> list[list[float]]:
"""Generate embeddings using OpenAI API."""
import openai
# Configure your API key
openai.api_key = "your-api-key"
response = openai.Embedding.create(
input=texts,
model="text-embedding-ada-002"
)
return [item["embedding"] for item in response["data"]]
# Now SIMILAR_TO will use OpenAI embeddings
nlql = NLQL(adapter=adapter)
results = nlql.execute('SELECT CHUNK WHERE SIMILAR_TO("AI") > 0.7')
Note: Embedding providers must be functions with signature (list[str]) -> list[list[float]]. They receive a batch of texts and return a batch of embedding vectors. You can use either decorator syntax (@register_embedding_provider) or function call syntax (register_embedding_provider(my_func)).
Custom Provider with Different Sentence-Transformers Model¶
Use a different sentence-transformers model:
from nlql.registry.embedding import register_embedding_provider
from sentence_transformers import SentenceTransformer
# Load model once (lazy loading)
_model = None
@register_embedding_provider
def custom_embedding_provider(texts: list[str]) -> list[list[float]]:
"""Generate embeddings using a different sentence-transformers model."""
global _model
if _model is None:
_model = SentenceTransformer("all-mpnet-base-v2")
embeddings = _model.encode(texts, convert_to_numpy=True)
return embeddings.tolist()
Custom Provider with Hugging Face Models¶
Use any Hugging Face model:
from nlql.registry.embedding import register_embedding_provider
from transformers import AutoTokenizer, AutoModel
import torch
# Load model once (lazy loading)
_tokenizer = None
_model = None
@register_embedding_provider
def huggingface_embedding_provider(texts: list[str]) -> list[list[float]]:
"""Generate embeddings using Hugging Face transformers."""
global _tokenizer, _model
if _tokenizer is None or _model is None:
_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
_model = AutoModel.from_pretrained("bert-base-uncased")
# Tokenize
encoded = _tokenizer(
texts,
padding=True,
truncation=True,
return_tensors="pt"
)
# Get model output
with torch.no_grad():
output = _model(**encoded)
# Mean pooling
embeddings = output.last_hidden_state.mean(dim=1)
return embeddings.tolist()
Provider Interface¶
All embedding providers must be functions with this signature:
def embedding_provider(texts: list[str]) -> list[list[float]]:
"""Generate embeddings for a list of texts.
Args:
texts: List of text strings to embed
Returns:
List of embedding vectors (each vector is a list of floats)
Example:
>>> embedding_provider(["hello", "world"])
[[0.1, 0.2, 0.3], [0.4, 0.5, 0.6]]
"""
# Your embedding logic here
pass
Best Practices¶
-
Lazy Loading: Load models only when needed to save memory:
-
Batch Processing: Process multiple texts at once for efficiency:
-
Error Handling: Handle errors gracefully:
-
Caching: Consider caching embeddings for frequently used texts:
from nlql.registry.embedding import register_embedding_provider # Cache for storing embeddings _cache = {} _base_model = None @register_embedding_provider def cached_embedding_provider(texts: list[str]) -> list[list[float]]: """Embedding provider with caching.""" global _cache, _base_model if _base_model is None: from sentence_transformers import SentenceTransformer _base_model = SentenceTransformer("all-MiniLM-L6-v2") results = [] uncached = [] uncached_indices = [] for i, text in enumerate(texts): if text in _cache: results.append(_cache[text]) else: uncached.append(text) uncached_indices.append(i) if uncached: new_embeddings = _base_model.encode(uncached).tolist() for text, emb in zip(uncached, new_embeddings): _cache[text] = emb results.append(emb) return results
Custom Adapters¶
Create adapters for new data sources:
from nlql.adapters import BaseAdapter, QueryPlan
from nlql.text.units import Chunk, TextUnit
class ElasticsearchAdapter(BaseAdapter):
"""Adapter for Elasticsearch."""
def __init__(self, es_client, index_name: str):
self.client = es_client
self.index = index_name
def query(self, plan: QueryPlan) -> list[TextUnit]:
# Build Elasticsearch query
es_query = {"bool": {"must": []}}
# Add metadata filters
if plan.filters:
for field, value in plan.filters.items():
es_query["bool"]["must"].append({
"term": {field: value}
})
# Add semantic search (if using vector field)
if plan.query_text:
# Implement vector search
pass
# Execute query
response = self.client.search(
index=self.index,
query=es_query,
size=plan.limit or 10,
)
# Convert to TextUnit
results = []
for hit in response["hits"]["hits"]:
chunk = Chunk(
content=hit["_source"]["content"],
metadata=hit["_source"].get("metadata", {}),
chunk_id=hit["_id"],
position=0,
)
results.append(chunk)
return results
def supports_semantic_search(self) -> bool:
return True # If you have vector fields
def supports_metadata_filter(self) -> bool:
return True
Configuration¶
Customize NLQL behavior with NLQLConfig:
from nlql import NLQL, NLQLConfig
from nlql.adapters import MemoryAdapter
# Create configuration
config = NLQLConfig(
default_limit=100, # Default LIMIT when query doesn't specify one
debug_mode=True, # Enable debug logging
)
# Create NLQL instance with config
adapter = MemoryAdapter()
nlql = NLQL(adapter=adapter, config=config)
# Query without LIMIT will use default_limit=100
results = nlql.execute("SELECT CHUNK WHERE CONTAINS('AI')")
Available Configuration Options:
default_limit(int | None): Default LIMIT value when query doesn't specify one. Default:None(no limit)debug_mode(bool): Enable debug logging for query execution steps. Default:Falseenable_caching(bool): Reserved for future caching implementation. Default:Falsecustom_settings(dict): Reserved for future extensibility. Default:{}
Best Practices¶
1. Naming Conventions¶
- Operators: UPPERCASE (e.g.,
HAS_EMAIL,CONTAINS_CODE) - Functions: lowercase (e.g.,
word_count,days_ago) - Types: PascalCase (e.g.,
PriorityType,CustomDateType)
2. Type Hints¶
Always use type hints for better IDE support:
3. Documentation¶
Document custom extensions:
@register_operator("CUSTOM_OP")
def custom_op(text: str) -> bool:
"""Check if text matches custom criteria.
Args:
text: Input text to check
Returns:
True if criteria is met
"""
...
4. Error Handling¶
Handle errors gracefully:
from nlql.errors import NLQLExecutionError
@register_function("safe_func")
def safe_func(value: str) -> int:
try:
return int(value)
except ValueError as e:
raise NLQLExecutionError(f"Cannot convert '{value}' to int") from e
Complete Example¶
For a comprehensive demonstration of all extensibility features, see the examples/extensibility_demo.py file in the repository.
This example shows:
- Custom Functions:
WORD_COUNT(),UPPERCASE(),EXTRACT_YEAR() - Custom Operators:
STARTS_WITH(),HAS_DIGIT(),REGEX_MATCH() - Custom Embedding Provider: Simple statistics-based embedding
- Integration: How custom extensions work with built-in NLQL features
Run the example:
Common Issues and Solutions¶
1. Function/Operator Name Conflicts¶
Problem: Custom function names containing built-in keywords (like COUNT, IS) may cause parsing errors.
Solution: Avoid using built-in keywords as prefixes in custom names:
# ❌ Bad - contains built-in keyword "COUNT"
@register_function("COUNTWORDS")
def count_words(text: str) -> int:
return len(text.split())
# ✅ Good - no keyword conflicts
@register_function("NUMWORDS")
def count_words(text: str) -> int:
return len(text.split())
Built-in keywords to avoid:
- Functions: LENGTH, NOW, COUNT
- Operators: MATCH, SIMILAR_TO, CONTAINS, IS, META
2. Custom Operators Must Be Uppercase¶
Problem: Lowercase operator names cause registration errors.
Solution: Always use UPPERCASE names for operators:
# ❌ Bad - lowercase name
@register_operator("starts_with")
def starts_with(text: str, prefix: str) -> bool:
return text.startswith(prefix)
# ✅ Good - uppercase name
@register_operator("STARTS_WITH")
def starts_with(text: str, prefix: str) -> bool:
return text.startswith(prefix)
3. Embedding Provider Signature¶
Problem: Custom embedding provider has wrong signature.
Solution: Embedding providers must accept list[str] and return list[list[float]]:
# ❌ Bad - wrong signature (single text)
def my_embedding(text: str) -> list[float]:
return [0.1, 0.2, 0.3]
# ✅ Good - correct signature (batch processing)
def my_embedding(texts: list[str]) -> list[list[float]]:
return [[0.1, 0.2, 0.3] for _ in texts]
4. Handling None Values in Functions¶
Problem: Functions returning None cause comparison errors.
Solution: Return a default value instead of None:
# ❌ Bad - returns None
@register_function("EXTRACT_YEAR")
def extract_year(text: str) -> int | None:
match = re.search(r'\b(19|20)\d{2}\b', text)
return int(match.group()) if match else None
# ✅ Good - returns default value
@register_function("EXTRACT_YEAR")
def extract_year(text: str) -> int:
match = re.search(r'\b(19|20)\d{2}\b', text)
return int(match.group()) if match else 0
5. Operator Argument Evaluation¶
Problem: Operator receives unevaluated AST nodes instead of values.
Solution: The evaluator automatically evaluates arguments before passing them to operators. Just define your operator with the expected types:
@register_operator("STARTS_WITH")
def starts_with(text: str, prefix: str) -> bool:
# text and prefix are already evaluated strings
return text.startswith(prefix)
Next Steps¶
- Check API Reference for detailed API docs
- See Architecture for system design
- Run the example files in the
examples/directory for hands-on learning