You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current implementation of ChromaDBVectorMemory in the AutoGen extension package doesn't expose parameters for setting custom embedding functions. It relies entirely on ChromaDB's default embedding function (Sentence Transformers all-MiniLM-L6-v2).
Goal
Allow users to customize the embedding function used by ChromaDBVectorMemory through a flexible, declarative configuration system that supports:
Default embedding function (current behavior)
Alternative Sentence Transformer models
OpenAI embeddings
Custom user-defined embedding functions
Rough Sketch of an Implementation Plan
1. Create Base Configuration Classes
Create a hierarchy of embedding function configurations:
classBaseEmbeddingFunctionConfig(BaseModel):
"""Base configuration for embedding functions."""function_type: Literal["default", "sentence_transformer", "openai", "custom"]
classDefaultEmbeddingFunctionConfig(BaseEmbeddingFunctionConfig):
"""Configuration for the default embedding function."""function_type: Literal["default", "sentence_transformer", "openai", "custom"] ="default"classSentenceTransformerEmbeddingFunctionConfig(BaseEmbeddingFunctionConfig):
"""Configuration for SentenceTransformer embedding functions."""function_type: Literal["default", "sentence_transformer", "openai", "custom"] ="sentence_transformer"model_name: str=Field(default="all-MiniLM-L6-v2", description="Model name to use")
classOpenAIEmbeddingFunctionConfig(BaseEmbeddingFunctionConfig):
"""Configuration for OpenAI embedding functions."""function_type: Literal["default", "sentence_transformer", "openai", "custom"] ="openai"api_key: str=Field(default="", description="OpenAI API key")
model_name: str=Field(default="text-embedding-ada-002", description="Model name")
2. Support Custom Embedding Functions
Add a configuration for custom embedding functions using the direct function approach:
classCustomEmbeddingFunctionConfig(BaseEmbeddingFunctionConfig):
"""Configuration for custom embedding functions."""function_type: Literal["default", "sentence_transformer", "openai", "custom"] ="custom"function: Callable[..., Any] =Field(description="Function that returns an embedding function")
params: Dict[str, Any] =Field(default_factory=dict, description="Parameters")
Note: Using a direct function in the configuration will make it non-serializable. The implementation should include appropriate warnings when users attempt to serialize configurations that contain function references.
3. Update ChromaDBVectorMemory Configuration
Extend the existing ChromaDBVectorMemoryConfig class to include the embedding function configuration:
classChromaDBVectorMemoryConfig(BaseModel):
# Existing fields...embedding_function_config: BaseEmbeddingFunctionConfig=Field(
default_factory=DefaultEmbeddingFunctionConfig,
description="Configuration for the embedding function"
)
4. Implement Embedding Function Creation
Add a method to ChromaDBVectorMemory that creates embedding functions based on configuration:
def_create_embedding_function(self):
"""Create an embedding function based on the configuration."""fromchromadb.utilsimportembedding_functionsconfig=self._config.embedding_function_configifconfig.function_type=="default":
returnembedding_functions.DefaultEmbeddingFunction()
elifconfig.function_type=="sentence_transformer":
cfg=cast(SentenceTransformerEmbeddingFunctionConfig, config)
returnembedding_functions.SentenceTransformerEmbeddingFunction(
model_name=cfg.model_name
)
elifconfig.function_type=="openai":
cfg=cast(OpenAIEmbeddingFunctionConfig, config)
returnembedding_functions.OpenAIEmbeddingFunction(
api_key=cfg.api_key,
model_name=cfg.model_name
)
elifconfig.function_type=="custom":
cfg=cast(CustomEmbeddingFunctionConfig, config)
returncfg.function(**cfg.params)
else:
raiseValueError(f"Unsupported embedding function type: {config.function_type}")
5. Update Collection Initialization
Modify the _ensure_initialized method to use the embedding function:
def_ensure_initialized(self) ->None:
# ... existing client initialization code ...ifself._collectionisNone:
try:
# Create embedding functionembedding_function=self._create_embedding_function()
# Create or get collection with embedding functionself._collection=self._client.get_or_create_collection(
name=self._config.collection_name,
metadata={"distance_metric": self._config.distance_metric},
embedding_function=embedding_function
)
exceptExceptionase:
logger.error(f"Failed to get/create collection: {e}")
raise
Example Usage
# Using default embedding functionmemory=ChromaDBVectorMemory(
config=PersistentChromaDBVectorMemoryConfig()
)
# Using a specific Sentence Transformer modelmemory=ChromaDBVectorMemory(
config=PersistentChromaDBVectorMemoryConfig(
embedding_function_config=SentenceTransformerEmbeddingFunctionConfig(
model_name="paraphrase-multilingual-mpnet-base-v2"
)
)
)
# Using OpenAI embeddingsmemory=ChromaDBVectorMemory(
config=PersistentChromaDBVectorMemoryConfig(
embedding_function_config=OpenAIEmbeddingFunctionConfig(
api_key="sk-...",
model_name="text-embedding-3-small"
)
)
)
# Using a custom embedding function (direct function approach)defcreate_my_embedder(param1="default"):
# Return a ChromaDB-compatible embedding functionclassMyCustomEmbeddingFunction(EmbeddingFunction):
def__call__(self, input: Documents) ->Embeddings:
# Custom embedding logic herereturnembeddingsreturnMyCustomEmbeddingFunction(param1)
memory=ChromaDBVectorMemory(
config=PersistentChromaDBVectorMemoryConfig(
embedding_function_config=CustomEmbeddingFunctionConfig(
function=create_my_embedder,
params={"param1": "custom_value"}
)
)
)
The text was updated successfully, but these errors were encountered:
@mpegram3rd
Thanks for the help with the types PR!
Might this be an issue that you might be interested in working on?
I have a rough sketch above mostly just as an initial design ... more work will be needed to arrive at a clean implementation with no side effects.
@victordibia Thanks for thinking of me. I doubt I could do anything with it this week, but if it is not a rush I could possibly take a look into it sometime next week (no promises).
Reading through your suggested fixes, I'd also need some time to really digest how this all fits together in the big picture.
Current Status
The current implementation of
ChromaDBVectorMemory
in the AutoGen extension package doesn't expose parameters for setting custom embedding functions. It relies entirely on ChromaDB's default embedding function (Sentence Transformers all-MiniLM-L6-v2).Goal
Allow users to customize the embedding function used by
ChromaDBVectorMemory
through a flexible, declarative configuration system that supports:Rough Sketch of an Implementation Plan
1. Create Base Configuration Classes
Create a hierarchy of embedding function configurations:
2. Support Custom Embedding Functions
Add a configuration for custom embedding functions using the direct function approach:
Note: Using a direct function in the configuration will make it non-serializable. The implementation should include appropriate warnings when users attempt to serialize configurations that contain function references.
3. Update ChromaDBVectorMemory Configuration
Extend the existing
ChromaDBVectorMemoryConfig
class to include the embedding function configuration:4. Implement Embedding Function Creation
Add a method to
ChromaDBVectorMemory
that creates embedding functions based on configuration:5. Update Collection Initialization
Modify the
_ensure_initialized
method to use the embedding function:Example Usage
The text was updated successfully, but these errors were encountered: