Text Embedding Service
Overview
This feature integrates semantic search capabilities by computing text embeddings for products using DJL (Deep Java Library).
These embeddings are stored in ElasticSearch as dense_vector fields, enabling semantic similarity queries.
Components
1. DjlTextEmbeddingService
Provided by the shared embedding-djl module in org.open4goods.embedding.service and auto-configured for API and front-api.
- Default models:
intfloat/multilingual-e5-small(text) with a multimodal CLIP fallback. - Tokenizer: Uses
ai.djl.huggingface:tokenizers. - Output: Normalised embeddings (default dimension 512) with mean pooling.
2. NamesAggregationService Integration
The embedding is computed during the product aggregation phase (onProduct).
- Trigger: Runs whenever a product has enough descriptive text (vertical is optional).
- Input Text: Construction of
Vertical Prefix (Category) + Product Name + Top Offer Names + Popular Attribute name/value pairs. - Throttling: Text truncated to 1000 characters (model limit ~512 tokens).
3. Product Model & Storage
- Java Model:
Productclass has afloat[] embeddingfield. - ElasticSearch:
product-mappings.jsondefinesembeddingasdense_vectorwithcosinesimilarity.
Configuration
Configure via embedding.* properties (see embedding-djl module):
embedding.text-model-url/embedding.multimodal-model-urlto choose remote identifiers (primary + fallback).embedding.fail-on-missing-model=truemakes startup fail if neither model loads.- Health details are exposed by
DjlEmbeddingHealthIndicatorwhen Spring Boot Actuator is enabled.
Deployment
- Dependencies: Requires
ai.djl.huggingface:tokenizersand the PyTorch engine. - Resources: First run downloads models from the configured DJL model URLs.
- Performance: Computation is CPU-based. Latency per item is approx 10-50ms depending on CPU.
Troubleshooting
- Missing Tokenizer: Ensure
ai.djl.huggingface:tokenizersis in the classpath. - Memory: Vectors are large. ElasticSearch storage size will increase.
- Logs: Check
DjlTextEmbeddingServicelogs for initialization errors and health indicator output.