Skip to content

Commit 33f122f

Browse files
authored
Merge pull request #2 from mpskex/master
Expanded Self-Query Retriever and Self-Query Retriever with MyScale
2 parents 5b6bbf4 + 7759afd commit 33f122f

File tree

17 files changed

+808
-432
lines changed

17 files changed

+808
-432
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,370 @@
1+
{
2+
"cells": [
3+
{
4+
"attachments": {},
5+
"cell_type": "markdown",
6+
"id": "13afcae7",
7+
"metadata": {},
8+
"source": [
9+
"# Self-querying with MyScale\n",
10+
"\n",
11+
">[MyScale](https://docs.myscale.com/en/) is an integrated vector database. You can access your database in SQL and also from here, LangChain. MyScale can make a use of [various data types and functions for filters](https://blog.myscale.com/2023/06/06/why-integrated-database-solution-can-boost-your-llm-apps/#filter-on-anything-without-constraints). It will boost up your LLM app no matter if you are scaling up your data or expand your system to broader application.\n",
12+
"\n",
13+
"In the notebook we'll demo the `SelfQueryRetriever` wrapped around a MyScale vector store with some extra piece we contributed to LangChain. In short, it can be concluded into 4 points:\n",
14+
"1. Add `contain` comparator to match list of any if there is more than one element matched\n",
15+
"2. Add `timestamp` data type for datetime match (ISO-format, or YYYY-MM-DD)\n",
16+
"3. Add `like` comparator for string pattern search\n",
17+
"4. Add arbitrary function capability"
18+
]
19+
},
20+
{
21+
"attachments": {},
22+
"cell_type": "markdown",
23+
"id": "68e75fb9",
24+
"metadata": {},
25+
"source": [
26+
"## Creating a MyScale vectorstore\n",
27+
"MyScale has already been integrated to LangChain for a while. So you can follow [this notebook](../../vectorstores/examples/myscale.ipynb) to create your own vectorstore for a self-query retriever.\n",
28+
"\n",
29+
"NOTE: All self-query retrievers requires you to have `lark` installed (`pip install lark`). We use `lark` for grammar definition. Before you proceed to the next step, we also want to remind you that `clickhouse-connect` is also needed to interact with your MyScale backend."
30+
]
31+
},
32+
{
33+
"cell_type": "code",
34+
"execution_count": null,
35+
"id": "63a8af5b",
36+
"metadata": {
37+
"tags": []
38+
},
39+
"outputs": [],
40+
"source": [
41+
"! pip install lark clickhouse-connect"
42+
]
43+
},
44+
{
45+
"attachments": {},
46+
"cell_type": "markdown",
47+
"id": "83811610-7df3-4ede-b268-68a6a83ba9e2",
48+
"metadata": {},
49+
"source": [
50+
"In this tutorial we follow other example's setting and use `OpenAIEmbeddings`. Remember to get a OpenAI API Key for valid accesss to LLMs."
51+
]
52+
},
53+
{
54+
"cell_type": "code",
55+
"execution_count": null,
56+
"id": "dd01b61b-7d32-4a55-85d6-b2d2d4f18840",
57+
"metadata": {
58+
"tags": []
59+
},
60+
"outputs": [],
61+
"source": [
62+
"import os\n",
63+
"import getpass\n",
64+
"\n",
65+
"os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key:')\n",
66+
"os.environ['MYSCALE_HOST'] = getpass.getpass('MyScale URL:')\n",
67+
"os.environ['MYSCALE_PORT'] = getpass.getpass('MyScale Port:')\n",
68+
"os.environ['MYSCALE_USERNAME'] = getpass.getpass('MyScale Username:')\n",
69+
"os.environ['MYSCALE_PASSWORD'] = getpass.getpass('MyScale Password:')"
70+
]
71+
},
72+
{
73+
"cell_type": "code",
74+
"execution_count": null,
75+
"id": "cb4a5787",
76+
"metadata": {
77+
"tags": []
78+
},
79+
"outputs": [],
80+
"source": [
81+
"from langchain.schema import Document\n",
82+
"from langchain.embeddings.openai import OpenAIEmbeddings\n",
83+
"from langchain.vectorstores import MyScale\n",
84+
"\n",
85+
"embeddings = OpenAIEmbeddings()"
86+
]
87+
},
88+
{
89+
"attachments": {},
90+
"cell_type": "markdown",
91+
"id": "bf7f6fc4",
92+
"metadata": {},
93+
"source": [
94+
"## Create some sample data\n",
95+
"As you can see, the data we created has some difference to other self-query retrievers. We replaced keyword `year` to `date` which gives you a finer control on timestamps. We also altered the type of keyword `gerne` to list of strings, where LLM can use a new `contain` comparator to construct filters. We also provides comparator `like` and arbitrary function support to filters, which will be introduced in next few cells.\n",
96+
"\n",
97+
"Now let's look at the data first."
98+
]
99+
},
100+
{
101+
"cell_type": "code",
102+
"execution_count": null,
103+
"id": "bcbe04d9",
104+
"metadata": {
105+
"tags": []
106+
},
107+
"outputs": [],
108+
"source": [
109+
"docs = [\n",
110+
" Document(page_content=\"A bunch of scientists bring back dinosaurs and mayhem breaks loose\", metadata={\"date\": \"1993-07-02\", \"rating\": 7.7, \"genre\": [\"science fiction\"]}),\n",
111+
" Document(page_content=\"Leo DiCaprio gets lost in a dream within a dream within a dream within a ...\", metadata={\"date\": \"2010-12-30\", \"director\": \"Christopher Nolan\", \"rating\": 8.2}),\n",
112+
" Document(page_content=\"A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea\", metadata={\"date\": \"2006-04-23\", \"director\": \"Satoshi Kon\", \"rating\": 8.6}),\n",
113+
" Document(page_content=\"A bunch of normal-sized women are supremely wholesome and some men pine after them\", metadata={\"date\": \"2019-08-22\", \"director\": \"Greta Gerwig\", \"rating\": 8.3}),\n",
114+
" Document(page_content=\"Toys come alive and have a blast doing so\", metadata={\"date\": \"1995-02-11\", \"genre\": [\"animated\"]}),\n",
115+
" Document(page_content=\"Three men walk into the Zone, three men walk out of the Zone\", metadata={\"date\": \"1979-09-10\", \"rating\": 9.9, \"director\": \"Andrei Tarkovsky\", \"genre\": [\"science fiction\", \"adventure\"], \"rating\": 9.9})\n",
116+
"]\n",
117+
"vectorstore = MyScale.from_documents(\n",
118+
" docs, \n",
119+
" embeddings, \n",
120+
")"
121+
]
122+
},
123+
{
124+
"attachments": {},
125+
"cell_type": "markdown",
126+
"id": "5ecaab6d",
127+
"metadata": {},
128+
"source": [
129+
"## Creating our self-querying retriever\n",
130+
"Just like other retrievers... Simple and nice."
131+
]
132+
},
133+
{
134+
"cell_type": "code",
135+
"execution_count": null,
136+
"id": "86e34dbf",
137+
"metadata": {
138+
"tags": []
139+
},
140+
"outputs": [],
141+
"source": [
142+
"from langchain.llms import OpenAI\n",
143+
"from langchain.retrievers.self_query.base import SelfQueryRetriever\n",
144+
"from langchain.chains.query_constructor.base import AttributeInfo\n",
145+
"\n",
146+
"metadata_field_info=[\n",
147+
" AttributeInfo(\n",
148+
" name=\"genre\",\n",
149+
" description=\"The genres of the movie\", \n",
150+
" type=\"list[string]\", \n",
151+
" ),\n",
152+
" # If you want to include length of a list, just define it as a new column\n",
153+
" # This will teach the LLM to use it as a column when constructing filter.\n",
154+
" AttributeInfo(\n",
155+
" name=\"length(genre)\",\n",
156+
" description=\"The lenth of genres of the movie\", \n",
157+
" type=\"integer\", \n",
158+
" ),\n",
159+
" # Now you can define a column as timestamp. By simply set the type to timestamp.\n",
160+
" AttributeInfo(\n",
161+
" name=\"date\",\n",
162+
" description=\"The date the movie was released\", \n",
163+
" type=\"timestamp\", \n",
164+
" ),\n",
165+
" AttributeInfo(\n",
166+
" name=\"director\",\n",
167+
" description=\"The name of the movie director\", \n",
168+
" type=\"string\", \n",
169+
" ),\n",
170+
" AttributeInfo(\n",
171+
" name=\"rating\",\n",
172+
" description=\"A 1-10 rating for the movie\",\n",
173+
" type=\"float\"\n",
174+
" ),\n",
175+
"]\n",
176+
"document_content_description = \"Brief summary of a movie\"\n",
177+
"llm = OpenAI(temperature=0)\n",
178+
"retriever = SelfQueryRetriever.from_llm(llm, vectorstore, document_content_description, metadata_field_info, verbose=True)"
179+
]
180+
},
181+
{
182+
"attachments": {},
183+
"cell_type": "markdown",
184+
"id": "ea9df8d4",
185+
"metadata": {},
186+
"source": [
187+
"## Testing it out with self-query retriever's existing functionalities\n",
188+
"And now we can try actually using our retriever!"
189+
]
190+
},
191+
{
192+
"cell_type": "code",
193+
"execution_count": null,
194+
"id": "38a126e9",
195+
"metadata": {},
196+
"outputs": [],
197+
"source": [
198+
"# This example only specifies a relevant query\n",
199+
"retriever.get_relevant_documents(\"What are some movies about dinosaurs\")"
200+
]
201+
},
202+
{
203+
"cell_type": "code",
204+
"execution_count": null,
205+
"id": "fc3f1e6e",
206+
"metadata": {
207+
"scrolled": false
208+
},
209+
"outputs": [],
210+
"source": [
211+
"# This example only specifies a filter\n",
212+
"retriever.get_relevant_documents(\"I want to watch a movie rated higher than 8.5\")"
213+
]
214+
},
215+
{
216+
"cell_type": "code",
217+
"execution_count": null,
218+
"id": "b19d4da0",
219+
"metadata": {},
220+
"outputs": [],
221+
"source": [
222+
"# This example specifies a query and a filter\n",
223+
"retriever.get_relevant_documents(\"Has Greta Gerwig directed any movies about women\")"
224+
]
225+
},
226+
{
227+
"cell_type": "code",
228+
"execution_count": null,
229+
"id": "f900e40e",
230+
"metadata": {},
231+
"outputs": [],
232+
"source": [
233+
"# This example specifies a composite filter\n",
234+
"retriever.get_relevant_documents(\"What's a highly rated (above 8.5) science fiction film?\")"
235+
]
236+
},
237+
{
238+
"cell_type": "code",
239+
"execution_count": null,
240+
"id": "12a51522",
241+
"metadata": {},
242+
"outputs": [],
243+
"source": [
244+
"# This example specifies a query and composite filter\n",
245+
"retriever.get_relevant_documents(\"What's a movie after 1990 but before 2005 that's all about toys, and preferably is animated\")"
246+
]
247+
},
248+
{
249+
"attachments": {},
250+
"cell_type": "markdown",
251+
"id": "86371ac8",
252+
"metadata": {},
253+
"source": [
254+
"# Wait a second... What else?\n",
255+
"\n",
256+
"Self-query retriever with MyScale can do more! Let's find out."
257+
]
258+
},
259+
{
260+
"cell_type": "code",
261+
"execution_count": null,
262+
"id": "1d043096",
263+
"metadata": {},
264+
"outputs": [],
265+
"source": [
266+
"# You can use length(genres) to do anything you want\n",
267+
"retriever.get_relevant_documents(\"What's a movie that have more than 1 genres?\")"
268+
]
269+
},
270+
{
271+
"cell_type": "code",
272+
"execution_count": null,
273+
"id": "d570d33c",
274+
"metadata": {},
275+
"outputs": [],
276+
"source": [
277+
"# Fine-grained datetime? You got it already.\n",
278+
"retriever.get_relevant_documents(\"What's a movie that release after feb 1995?\")"
279+
]
280+
},
281+
{
282+
"cell_type": "code",
283+
"execution_count": null,
284+
"id": "fbe0b21b",
285+
"metadata": {},
286+
"outputs": [],
287+
"source": [
288+
"# Don't know what your exact filter should be? Use string pattern match!\n",
289+
"retriever.get_relevant_documents(\"What's a movie whose name is like Andrei?\")"
290+
]
291+
},
292+
{
293+
"cell_type": "code",
294+
"execution_count": null,
295+
"id": "6a514104",
296+
"metadata": {},
297+
"outputs": [],
298+
"source": [
299+
"# Contain works for lists: so you can match a list with contain comparator!\n",
300+
"retriever.get_relevant_documents(\"What's a movie who has genres science fiction and adventure?\")"
301+
]
302+
},
303+
{
304+
"attachments": {},
305+
"cell_type": "markdown",
306+
"id": "39bd1de1-b9fe-4a98-89da-58d8a7a6ae51",
307+
"metadata": {},
308+
"source": [
309+
"## Filter k\n",
310+
"\n",
311+
"We can also use the self query retriever to specify `k`: the number of documents to fetch.\n",
312+
"\n",
313+
"We can do this by passing `enable_limit=True` to the constructor."
314+
]
315+
},
316+
{
317+
"cell_type": "code",
318+
"execution_count": null,
319+
"id": "bff36b88-b506-4877-9c63-e5a1a8d78e64",
320+
"metadata": {
321+
"tags": []
322+
},
323+
"outputs": [],
324+
"source": [
325+
"retriever = SelfQueryRetriever.from_llm(\n",
326+
" llm, \n",
327+
" vectorstore, \n",
328+
" document_content_description, \n",
329+
" metadata_field_info, \n",
330+
" enable_limit=True,\n",
331+
" verbose=True\n",
332+
")"
333+
]
334+
},
335+
{
336+
"cell_type": "code",
337+
"execution_count": null,
338+
"id": "2758d229-4f97-499c-819f-888acaf8ee10",
339+
"metadata": {
340+
"tags": []
341+
},
342+
"outputs": [],
343+
"source": [
344+
"# This example only specifies a relevant query\n",
345+
"retriever.get_relevant_documents(\"what are two movies about dinosaurs\")"
346+
]
347+
}
348+
],
349+
"metadata": {
350+
"kernelspec": {
351+
"display_name": "Python 3 (ipykernel)",
352+
"language": "python",
353+
"name": "python3"
354+
},
355+
"language_info": {
356+
"codemirror_mode": {
357+
"name": "ipython",
358+
"version": 3
359+
},
360+
"file_extension": ".py",
361+
"mimetype": "text/x-python",
362+
"name": "python",
363+
"nbconvert_exporter": "python",
364+
"pygments_lexer": "ipython3",
365+
"version": "3.8.8"
366+
}
367+
},
368+
"nbformat": 4,
369+
"nbformat_minor": 5
370+
}

0 commit comments

Comments
 (0)