Embeddings and RAG with Azure OpenAI API

(Article written by guest author Dr. Gerd Kortemeyer)

This blog describes my personal experiences with Azure Open AI API and embeddings, which can be used to efficiently implement Retrieval Augmented Generation (RAG). I am very sure that several things could be done more elegantly; feel free to drop me an email with suggestions for improvements. In any case: it works.

What is it?

Text embeddings are a way of representing documents in a “continuous, high-dimensional space.” Sounds like Star Trek, but the key idea behind embeddings is to capture the semantic meaning of the document in a way that reflects the relationships and similarities between different words and sentences (behind the scenes, between smaller units called tokens). This allows to query a document for text passages that semantically match a query, rather than looking for keywords, etc. Retrieval Augmented Generation (RAG) is used to inject relevant text passages into a prompt for an LLM. When the LLM answers a question in a prompt, it can then draw on those text passages to give a more specific and hopefully more correct answer.

Typically, the workflow is to chunk up a document into passages (sections, subsections, paragraphs, …), generate embeddings for these passages, and store those in a vector database; that step has to be done once. When an LLM is queried, an embedding of the original prompt is generated, and semantically similar passages are retrieved from the vector database (the “retrieval” in “RAG”). These are than added into to the query (the “augmentation” in “RAG”), and a response is generated based on the original prompt with additional background (the “generation” in “RAG”).

Ingredients

Azure OpenAI API works well, but slightly differently than the API provided directly by OpenAI. Much of the documentation found on the web applies to OpenAI, and it took me a while to figure out how to do the same things with Azure. The best starting point is the article on the CSC blog on how to set up deployments.

After following the steps in the article “Getting started with Azure Open AI”, you should already have a deployment for an LLM; I named my deployment of gpt-4-32k “EthelTest”, which is for Project Ethel, see screenshot below:

Getting Connected

The standard way for setting the connection parameters for the deployments is using the environment, but I personally find configuration files less cumbersome. Here’s what I used:

server = https://XXXXXXXXXXX.api.cognitive.microsoft.com/
key = XXXXXXXXXXX
deployment = EthelTest
embed = EthelTestEmb
api = 2023-03-15-preview
type = azure

The XXXX of course needs to be replaced by your access credentials, which I am selfishly withholding from you. As a server, I am using switzerlandnorth.api… , since it seems prudent based on data privacy considerations.

I then wrote a function to establish the connections, which returns the handles for both the LLM and the embedding (this grew over time, as I realized more and more needed parameters; there are certainly better ways than a growing elif-chain to interpret the file … oh, well):

import os
from langchain.llms import AzureOpenAI
from langchain.chat_models import AzureChatOpenAI
from langchain.embeddings import OpenAIEmbeddings
import logging

def make_llm(filename="../configs/azure_ai_config.txt"):
    deployment = None
    embed = None
    try:
       with open(filename, 'r') as file:
           for line in file:
               line = line.strip()
               value = line.split('=')[1].strip()

               if line.startswith("server ="):
                   os.environ["OPENAI_API_BASE"] = value

               elif line.startswith("key ="):
                   os.environ["OPENAI_API_KEY"] = value

               elif line.startswith("type ="):
                   os.environ["OPENAI_API_TYPE"] = value

               elif line.startswith("api ="):
                   os.environ["OPENAI_API_VERSION"] = value

               elif line.startswith("deployment ="):  
                   deployment = value

               elif line.startswith("embed ="):
                   embed = value
[…]
       if deployment is None or embed is None:
          logging.error("Could not get configuration variables")
          exit()
    except:
       logging.error("Could not open configuration file")
       exit()
    return AzureChatOpenAI(deployment_name=deployment),OpenAIEmbeddings(deployment=embed,chunk_size=1)

Notice the last line, which shows some of those nasty little differences between the interfaces, like deployment_name versus deployment. By the time this gets published, this may have been remedied, as things seem to be in constant flux. Note that at least in my experience, using ChatOpenAI does not work with Azure, and one must use AzureChatOpenAI, while there is no Azure-flavored equivalent that I could find for OpenAIEmbeddings. I wrapped this into a package and import it every time I want to talk to my deployments.

Update 2023-02-23 – Reported by users who read this article that the code below also works now:

from langchain_openai import AzureOpenAIEmbeddings
[…]
embeddings = AzureOpenAIEmbeddings(
azure_deployment="your-embeddings-deployment-name",
openai_api_version="2023-05-15",

Dissecting and Digesting Documents

The first step toward RAG is splitting a document into chunks, generating the embeddings, and storing them in the database. I used ChromaDB as a database, which works fine, but might not be the best choice for production. The following is an early version of my digester, which is based on the examples that can be found on the web:

from openaicalls import make_llm
from langchain.schema import StrOutputParser
from langchain.prompts import ChatPromptTemplate
from langchain.llms import AzureOpenAI

from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain_core.runnables import RunnableParallel, RunnablePassthrough, RunnableLambda

from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter


model,emb = make_llm()

loader = TextLoader("./docs/ai_survey.tex")
docs=loader.load()

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200, add_start_index=True
)
all_splits = text_splitter.split_documents(docs)

Chroma.from_documents(documents=all_splits,embedding=emb,persist_directory="./chroma_db")

This uses the out-of-the-box recursive text splitter from LangChain, which generates quite reasonable chunks, but there are other splitters. What splitter with what chunk size and overlap works best depends on the document, and it needs experimentation.

When discussing using Azure, an important consideration is the token rate, i.e., the tokens per minute. For each of your deployments, there is a rate limit, which you can crank up a little in the interface, but not high enough for rapid-fire bundles of hundreds of chunks. Frustratingly, for a long document, the above code runs for a while (wasting CPU cycles and costing money) and then crashes complaining about the token rate and asking the user to wait a second. After some experimentation, it turns out that it works to do this in smaller batches with artificial delays between calls:

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200, add_start_index=True
)
all_splits = text_splitter.split_documents(docs)

# Set the batch size and delay time
batch_size = 50  # Adjust this based on your rate limit
delay_time = 5  # Adjust the delay time in seconds

# Function to process documents in batches
def process_in_batches(documents, batch_size, delay_time):
    for i in range(0, len(documents), batch_size):
        batch = documents[i:i + batch_size]
        ids = [str(j) for j in range(i, i+batch_size)]
        try:
           Chroma.from_documents(batch,emb,ids=ids,persist_directory="./chroma_db")
        except Exception as error:
           print(f"Handling request: {error}")
           exit()
        print(i)
        time.sleep(delay_time)  # Pause between batches

# Call the batch processing function
process_in_batches(all_splits, batch_size, delay_time)

The above code is dirty; by dumb luck, the total number of chunks generated by the splitter was divisible by my batch size. More universal code needs to deal with the leftover chunks, but you get the idea.

Augmentation

The augmentation step is the easy part, as it can be done pretty much according to the tutorial. The crucial steps for the ChromaDB and Azure-connection are:

model,emb = make_llm()
#
# Retrieve the vectorstore with the documents
#
os.environ["ANONYMIZED_TELEMETRY"] = "False"
try:
   vectorstore = Chroma(persist_directory="../database/chroma_db",embedding_function=emb)
   retriever = vectorstore.as_retriever()
except Exception as error:
   logging.error(f"Opening vectorstore: {error}")
   exit()

The handle on the embedding needs to be passed to ChromaDB as embedding_function. It is important that the embedding function used here is the same as was used in the digester, so do not simply upgrade your deployment to a newer version without redoing the digester step.

Anyway, that’s it. Enjoy!

Gerd Kortemeyer, Ph.D.

Rectorate and AI Center, ETH Zurich

Associate Professor Emeritus, Michigan State University

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut elit tellus, luctus nec ullamcorper mattis, pulvinar dapibus leo.

Cloud Service Center