|
| 1 | +{ |
| 2 | + "cells": [ |
| 3 | + { |
| 4 | + "attachments": {}, |
| 5 | + "cell_type": "markdown", |
| 6 | + "id": "13afcae7", |
| 7 | + "metadata": {}, |
| 8 | + "source": [ |
| 9 | + "# Self-querying with MyScale\n", |
| 10 | + "\n", |
| 11 | + ">[MyScale](https://docs.myscale.com/en/) is an integrated vector database. You can access your database in SQL and also from here, LangChain. MyScale can make a use of [various data types and functions for filters](https://blog.myscale.com/2023/06/06/why-integrated-database-solution-can-boost-your-llm-apps/#filter-on-anything-without-constraints). It will boost up your LLM app no matter if you are scaling up your data or expand your system to broader application.\n", |
| 12 | + "\n", |
| 13 | + "In the notebook we'll demo the `SelfQueryRetriever` wrapped around a MyScale vector store with some extra piece we contributed to LangChain. In short, it can be concluded into 4 points:\n", |
| 14 | + "1. Add `contain` comparator to match list of any if there is more than one element matched\n", |
| 15 | + "2. Add `timestamp` data type for datetime match (ISO-format, or YYYY-MM-DD)\n", |
| 16 | + "3. Add `like` comparator for string pattern search\n", |
| 17 | + "4. Add arbitrary function capability" |
| 18 | + ] |
| 19 | + }, |
| 20 | + { |
| 21 | + "attachments": {}, |
| 22 | + "cell_type": "markdown", |
| 23 | + "id": "68e75fb9", |
| 24 | + "metadata": {}, |
| 25 | + "source": [ |
| 26 | + "## Creating a MyScale vectorstore\n", |
| 27 | + "MyScale has already been integrated to LangChain for a while. So you can follow [this notebook](../../vectorstores/examples/myscale.ipynb) to create your own vectorstore for a self-query retriever.\n", |
| 28 | + "\n", |
| 29 | + "NOTE: All self-query retrievers requires you to have `lark` installed (`pip install lark`). We use `lark` for grammar definition. Before you proceed to the next step, we also want to remind you that `clickhouse-connect` is also needed to interact with your MyScale backend." |
| 30 | + ] |
| 31 | + }, |
| 32 | + { |
| 33 | + "cell_type": "code", |
| 34 | + "execution_count": null, |
| 35 | + "id": "63a8af5b", |
| 36 | + "metadata": { |
| 37 | + "tags": [] |
| 38 | + }, |
| 39 | + "outputs": [], |
| 40 | + "source": [ |
| 41 | + "! pip install lark clickhouse-connect" |
| 42 | + ] |
| 43 | + }, |
| 44 | + { |
| 45 | + "attachments": {}, |
| 46 | + "cell_type": "markdown", |
| 47 | + "id": "83811610-7df3-4ede-b268-68a6a83ba9e2", |
| 48 | + "metadata": {}, |
| 49 | + "source": [ |
| 50 | + "In this tutorial we follow other example's setting and use `OpenAIEmbeddings`. Remember to get a OpenAI API Key for valid accesss to LLMs." |
| 51 | + ] |
| 52 | + }, |
| 53 | + { |
| 54 | + "cell_type": "code", |
| 55 | + "execution_count": null, |
| 56 | + "id": "dd01b61b-7d32-4a55-85d6-b2d2d4f18840", |
| 57 | + "metadata": { |
| 58 | + "tags": [] |
| 59 | + }, |
| 60 | + "outputs": [], |
| 61 | + "source": [ |
| 62 | + "import os\n", |
| 63 | + "import getpass\n", |
| 64 | + "\n", |
| 65 | + "os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key:')\n", |
| 66 | + "os.environ['MYSCALE_HOST'] = getpass.getpass('MyScale URL:')\n", |
| 67 | + "os.environ['MYSCALE_PORT'] = getpass.getpass('MyScale Port:')\n", |
| 68 | + "os.environ['MYSCALE_USERNAME'] = getpass.getpass('MyScale Username:')\n", |
| 69 | + "os.environ['MYSCALE_PASSWORD'] = getpass.getpass('MyScale Password:')" |
| 70 | + ] |
| 71 | + }, |
| 72 | + { |
| 73 | + "cell_type": "code", |
| 74 | + "execution_count": null, |
| 75 | + "id": "cb4a5787", |
| 76 | + "metadata": { |
| 77 | + "tags": [] |
| 78 | + }, |
| 79 | + "outputs": [], |
| 80 | + "source": [ |
| 81 | + "from langchain.schema import Document\n", |
| 82 | + "from langchain.embeddings.openai import OpenAIEmbeddings\n", |
| 83 | + "from langchain.vectorstores import MyScale\n", |
| 84 | + "\n", |
| 85 | + "embeddings = OpenAIEmbeddings()" |
| 86 | + ] |
| 87 | + }, |
| 88 | + { |
| 89 | + "attachments": {}, |
| 90 | + "cell_type": "markdown", |
| 91 | + "id": "bf7f6fc4", |
| 92 | + "metadata": {}, |
| 93 | + "source": [ |
| 94 | + "## Create some sample data\n", |
| 95 | + "As you can see, the data we created has some difference to other self-query retrievers. We replaced keyword `year` to `date` which gives you a finer control on timestamps. We also altered the type of keyword `gerne` to list of strings, where LLM can use a new `contain` comparator to construct filters. We also provides comparator `like` and arbitrary function support to filters, which will be introduced in next few cells.\n", |
| 96 | + "\n", |
| 97 | + "Now let's look at the data first." |
| 98 | + ] |
| 99 | + }, |
| 100 | + { |
| 101 | + "cell_type": "code", |
| 102 | + "execution_count": null, |
| 103 | + "id": "bcbe04d9", |
| 104 | + "metadata": { |
| 105 | + "tags": [] |
| 106 | + }, |
| 107 | + "outputs": [], |
| 108 | + "source": [ |
| 109 | + "docs = [\n", |
| 110 | + " Document(page_content=\"A bunch of scientists bring back dinosaurs and mayhem breaks loose\", metadata={\"date\": \"1993-07-02\", \"rating\": 7.7, \"genre\": [\"science fiction\"]}),\n", |
| 111 | + " Document(page_content=\"Leo DiCaprio gets lost in a dream within a dream within a dream within a ...\", metadata={\"date\": \"2010-12-30\", \"director\": \"Christopher Nolan\", \"rating\": 8.2}),\n", |
| 112 | + " Document(page_content=\"A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea\", metadata={\"date\": \"2006-04-23\", \"director\": \"Satoshi Kon\", \"rating\": 8.6}),\n", |
| 113 | + " Document(page_content=\"A bunch of normal-sized women are supremely wholesome and some men pine after them\", metadata={\"date\": \"2019-08-22\", \"director\": \"Greta Gerwig\", \"rating\": 8.3}),\n", |
| 114 | + " Document(page_content=\"Toys come alive and have a blast doing so\", metadata={\"date\": \"1995-02-11\", \"genre\": [\"animated\"]}),\n", |
| 115 | + " Document(page_content=\"Three men walk into the Zone, three men walk out of the Zone\", metadata={\"date\": \"1979-09-10\", \"rating\": 9.9, \"director\": \"Andrei Tarkovsky\", \"genre\": [\"science fiction\", \"adventure\"], \"rating\": 9.9})\n", |
| 116 | + "]\n", |
| 117 | + "vectorstore = MyScale.from_documents(\n", |
| 118 | + " docs, \n", |
| 119 | + " embeddings, \n", |
| 120 | + ")" |
| 121 | + ] |
| 122 | + }, |
| 123 | + { |
| 124 | + "attachments": {}, |
| 125 | + "cell_type": "markdown", |
| 126 | + "id": "5ecaab6d", |
| 127 | + "metadata": {}, |
| 128 | + "source": [ |
| 129 | + "## Creating our self-querying retriever\n", |
| 130 | + "Just like other retrievers... Simple and nice." |
| 131 | + ] |
| 132 | + }, |
| 133 | + { |
| 134 | + "cell_type": "code", |
| 135 | + "execution_count": null, |
| 136 | + "id": "86e34dbf", |
| 137 | + "metadata": { |
| 138 | + "tags": [] |
| 139 | + }, |
| 140 | + "outputs": [], |
| 141 | + "source": [ |
| 142 | + "from langchain.llms import OpenAI\n", |
| 143 | + "from langchain.retrievers.self_query.base import SelfQueryRetriever\n", |
| 144 | + "from langchain.chains.query_constructor.base import AttributeInfo\n", |
| 145 | + "\n", |
| 146 | + "metadata_field_info=[\n", |
| 147 | + " AttributeInfo(\n", |
| 148 | + " name=\"genre\",\n", |
| 149 | + " description=\"The genres of the movie\", \n", |
| 150 | + " type=\"list[string]\", \n", |
| 151 | + " ),\n", |
| 152 | + " # If you want to include length of a list, just define it as a new column\n", |
| 153 | + " # This will teach the LLM to use it as a column when constructing filter.\n", |
| 154 | + " AttributeInfo(\n", |
| 155 | + " name=\"length(genre)\",\n", |
| 156 | + " description=\"The lenth of genres of the movie\", \n", |
| 157 | + " type=\"integer\", \n", |
| 158 | + " ),\n", |
| 159 | + " # Now you can define a column as timestamp. By simply set the type to timestamp.\n", |
| 160 | + " AttributeInfo(\n", |
| 161 | + " name=\"date\",\n", |
| 162 | + " description=\"The date the movie was released\", \n", |
| 163 | + " type=\"timestamp\", \n", |
| 164 | + " ),\n", |
| 165 | + " AttributeInfo(\n", |
| 166 | + " name=\"director\",\n", |
| 167 | + " description=\"The name of the movie director\", \n", |
| 168 | + " type=\"string\", \n", |
| 169 | + " ),\n", |
| 170 | + " AttributeInfo(\n", |
| 171 | + " name=\"rating\",\n", |
| 172 | + " description=\"A 1-10 rating for the movie\",\n", |
| 173 | + " type=\"float\"\n", |
| 174 | + " ),\n", |
| 175 | + "]\n", |
| 176 | + "document_content_description = \"Brief summary of a movie\"\n", |
| 177 | + "llm = OpenAI(temperature=0)\n", |
| 178 | + "retriever = SelfQueryRetriever.from_llm(llm, vectorstore, document_content_description, metadata_field_info, verbose=True)" |
| 179 | + ] |
| 180 | + }, |
| 181 | + { |
| 182 | + "attachments": {}, |
| 183 | + "cell_type": "markdown", |
| 184 | + "id": "ea9df8d4", |
| 185 | + "metadata": {}, |
| 186 | + "source": [ |
| 187 | + "## Testing it out with self-query retriever's existing functionalities\n", |
| 188 | + "And now we can try actually using our retriever!" |
| 189 | + ] |
| 190 | + }, |
| 191 | + { |
| 192 | + "cell_type": "code", |
| 193 | + "execution_count": null, |
| 194 | + "id": "38a126e9", |
| 195 | + "metadata": {}, |
| 196 | + "outputs": [], |
| 197 | + "source": [ |
| 198 | + "# This example only specifies a relevant query\n", |
| 199 | + "retriever.get_relevant_documents(\"What are some movies about dinosaurs\")" |
| 200 | + ] |
| 201 | + }, |
| 202 | + { |
| 203 | + "cell_type": "code", |
| 204 | + "execution_count": null, |
| 205 | + "id": "fc3f1e6e", |
| 206 | + "metadata": { |
| 207 | + "scrolled": false |
| 208 | + }, |
| 209 | + "outputs": [], |
| 210 | + "source": [ |
| 211 | + "# This example only specifies a filter\n", |
| 212 | + "retriever.get_relevant_documents(\"I want to watch a movie rated higher than 8.5\")" |
| 213 | + ] |
| 214 | + }, |
| 215 | + { |
| 216 | + "cell_type": "code", |
| 217 | + "execution_count": null, |
| 218 | + "id": "b19d4da0", |
| 219 | + "metadata": {}, |
| 220 | + "outputs": [], |
| 221 | + "source": [ |
| 222 | + "# This example specifies a query and a filter\n", |
| 223 | + "retriever.get_relevant_documents(\"Has Greta Gerwig directed any movies about women\")" |
| 224 | + ] |
| 225 | + }, |
| 226 | + { |
| 227 | + "cell_type": "code", |
| 228 | + "execution_count": null, |
| 229 | + "id": "f900e40e", |
| 230 | + "metadata": {}, |
| 231 | + "outputs": [], |
| 232 | + "source": [ |
| 233 | + "# This example specifies a composite filter\n", |
| 234 | + "retriever.get_relevant_documents(\"What's a highly rated (above 8.5) science fiction film?\")" |
| 235 | + ] |
| 236 | + }, |
| 237 | + { |
| 238 | + "cell_type": "code", |
| 239 | + "execution_count": null, |
| 240 | + "id": "12a51522", |
| 241 | + "metadata": {}, |
| 242 | + "outputs": [], |
| 243 | + "source": [ |
| 244 | + "# This example specifies a query and composite filter\n", |
| 245 | + "retriever.get_relevant_documents(\"What's a movie after 1990 but before 2005 that's all about toys, and preferably is animated\")" |
| 246 | + ] |
| 247 | + }, |
| 248 | + { |
| 249 | + "attachments": {}, |
| 250 | + "cell_type": "markdown", |
| 251 | + "id": "86371ac8", |
| 252 | + "metadata": {}, |
| 253 | + "source": [ |
| 254 | + "# Wait a second... What else?\n", |
| 255 | + "\n", |
| 256 | + "Self-query retriever with MyScale can do more! Let's find out." |
| 257 | + ] |
| 258 | + }, |
| 259 | + { |
| 260 | + "cell_type": "code", |
| 261 | + "execution_count": null, |
| 262 | + "id": "1d043096", |
| 263 | + "metadata": {}, |
| 264 | + "outputs": [], |
| 265 | + "source": [ |
| 266 | + "# You can use length(genres) to do anything you want\n", |
| 267 | + "retriever.get_relevant_documents(\"What's a movie that have more than 1 genres?\")" |
| 268 | + ] |
| 269 | + }, |
| 270 | + { |
| 271 | + "cell_type": "code", |
| 272 | + "execution_count": null, |
| 273 | + "id": "d570d33c", |
| 274 | + "metadata": {}, |
| 275 | + "outputs": [], |
| 276 | + "source": [ |
| 277 | + "# Fine-grained datetime? You got it already.\n", |
| 278 | + "retriever.get_relevant_documents(\"What's a movie that release after feb 1995?\")" |
| 279 | + ] |
| 280 | + }, |
| 281 | + { |
| 282 | + "cell_type": "code", |
| 283 | + "execution_count": null, |
| 284 | + "id": "fbe0b21b", |
| 285 | + "metadata": {}, |
| 286 | + "outputs": [], |
| 287 | + "source": [ |
| 288 | + "# Don't know what your exact filter should be? Use string pattern match!\n", |
| 289 | + "retriever.get_relevant_documents(\"What's a movie whose name is like Andrei?\")" |
| 290 | + ] |
| 291 | + }, |
| 292 | + { |
| 293 | + "cell_type": "code", |
| 294 | + "execution_count": null, |
| 295 | + "id": "6a514104", |
| 296 | + "metadata": {}, |
| 297 | + "outputs": [], |
| 298 | + "source": [ |
| 299 | + "# Contain works for lists: so you can match a list with contain comparator!\n", |
| 300 | + "retriever.get_relevant_documents(\"What's a movie who has genres science fiction and adventure?\")" |
| 301 | + ] |
| 302 | + }, |
| 303 | + { |
| 304 | + "attachments": {}, |
| 305 | + "cell_type": "markdown", |
| 306 | + "id": "39bd1de1-b9fe-4a98-89da-58d8a7a6ae51", |
| 307 | + "metadata": {}, |
| 308 | + "source": [ |
| 309 | + "## Filter k\n", |
| 310 | + "\n", |
| 311 | + "We can also use the self query retriever to specify `k`: the number of documents to fetch.\n", |
| 312 | + "\n", |
| 313 | + "We can do this by passing `enable_limit=True` to the constructor." |
| 314 | + ] |
| 315 | + }, |
| 316 | + { |
| 317 | + "cell_type": "code", |
| 318 | + "execution_count": null, |
| 319 | + "id": "bff36b88-b506-4877-9c63-e5a1a8d78e64", |
| 320 | + "metadata": { |
| 321 | + "tags": [] |
| 322 | + }, |
| 323 | + "outputs": [], |
| 324 | + "source": [ |
| 325 | + "retriever = SelfQueryRetriever.from_llm(\n", |
| 326 | + " llm, \n", |
| 327 | + " vectorstore, \n", |
| 328 | + " document_content_description, \n", |
| 329 | + " metadata_field_info, \n", |
| 330 | + " enable_limit=True,\n", |
| 331 | + " verbose=True\n", |
| 332 | + ")" |
| 333 | + ] |
| 334 | + }, |
| 335 | + { |
| 336 | + "cell_type": "code", |
| 337 | + "execution_count": null, |
| 338 | + "id": "2758d229-4f97-499c-819f-888acaf8ee10", |
| 339 | + "metadata": { |
| 340 | + "tags": [] |
| 341 | + }, |
| 342 | + "outputs": [], |
| 343 | + "source": [ |
| 344 | + "# This example only specifies a relevant query\n", |
| 345 | + "retriever.get_relevant_documents(\"what are two movies about dinosaurs\")" |
| 346 | + ] |
| 347 | + } |
| 348 | + ], |
| 349 | + "metadata": { |
| 350 | + "kernelspec": { |
| 351 | + "display_name": "Python 3 (ipykernel)", |
| 352 | + "language": "python", |
| 353 | + "name": "python3" |
| 354 | + }, |
| 355 | + "language_info": { |
| 356 | + "codemirror_mode": { |
| 357 | + "name": "ipython", |
| 358 | + "version": 3 |
| 359 | + }, |
| 360 | + "file_extension": ".py", |
| 361 | + "mimetype": "text/x-python", |
| 362 | + "name": "python", |
| 363 | + "nbconvert_exporter": "python", |
| 364 | + "pygments_lexer": "ipython3", |
| 365 | + "version": "3.8.8" |
| 366 | + } |
| 367 | + }, |
| 368 | + "nbformat": 4, |
| 369 | + "nbformat_minor": 5 |
| 370 | +} |
0 commit comments