Published

Text Embedding Service

Text Embedding Service

Overview

This feature integrates semantic search capabilities by computing text embeddings for products using DJL (Deep Java Library). These embeddings are stored in ElasticSearch as dense_vector fields, enabling semantic similarity queries.

Components

1. DjlTextEmbeddingService

Provided by the shared embedding-djl module in org.open4goods.embedding.service and auto-configured for API and front-api.

  • Default models: intfloat/multilingual-e5-small (text) with a multimodal CLIP fallback.
  • Tokenizer: Uses ai.djl.huggingface:tokenizers.
  • Output: Normalised embeddings (default dimension 512) with mean pooling.

2. NamesAggregationService Integration

The embedding is computed during the product aggregation phase (onProduct).

  • Trigger: Runs whenever a product has enough descriptive text (vertical is optional).
  • Input Text: Construction of Vertical Prefix (Category) + Product Name + Top Offer Names + Popular Attribute name/value pairs.
  • Throttling: Text truncated to 1000 characters (model limit ~512 tokens).

3. Product Model & Storage

  • Java Model: Product class has a float[] embedding field.
  • ElasticSearch: product-mappings.json defines embedding as dense_vector with cosine similarity.

Configuration

Configure via embedding.* properties (see embedding-djl module):

  • embedding.text-model-url / embedding.multimodal-model-url to choose remote identifiers (primary + fallback).
  • embedding.fail-on-missing-model=true makes startup fail if neither model loads.
  • Health details are exposed by DjlEmbeddingHealthIndicator when Spring Boot Actuator is enabled.

Deployment

  • Dependencies: Requires ai.djl.huggingface:tokenizers and the PyTorch engine.
  • Resources: First run downloads models from the configured DJL model URLs.
  • Performance: Computation is CPU-based. Latency per item is approx 10-50ms depending on CPU.

Troubleshooting

  • Missing Tokenizer: Ensure ai.djl.huggingface:tokenizers is in the classpath.
  • Memory: Vectors are large. ElasticSearch storage size will increase.
  • Logs: Check DjlTextEmbeddingService logs for initialization errors and health indicator output.