Generating data for fine-tuning the LLM using the llama index

Introduction

Lots of modern applications are built using large language models(LLM). Most of the problems with the LLMs are that they need help understanding our data. We can use RAG to pass our data as context for LLM’s. There is always a limitation on context size and understanding our domain-specific language. The only solution to these kinds of problems is fine-tuning. Most of the developers want to fine-tune the either open source model or the chat GPT model. Availability of the data in the required format is a difficult problem. We can use the open-source data from Hugging Face, but the applications are limited. In this blog post, we are going to discuss how we can generate training data from our domain-specific documents.

Use Case

We are going to build a chatbot Dolly. We will fine-tune the dolly using the national health policy documents. In your case, the specialisation will be different but you can follow the same steps. You can use any PDF documents you want to fine-tune the model.

Open AI fine-tuning

In this example, we are going to fine-tune the Open AI chat model GPT 3.5 turbo. The training dataset should follow a specific structure which is given below. 

{"messages": [{"role": "system", "content": "Dolly is a chatbot who can answer questions about Indian national policy"}, {"role": "user", "content": "1. What are the major changes in the health sector since the last health policy in 2002?"}, {"role": "assistant", "content": "The major changes in the health sector since the last health policy in 2002 include: \n1. Changing health priorities, with a decline in maternal and child mortality but a growing burden of non-communicable diseases and some infectious diseases.\n2. Emergence of a robust healthcare industry that is estimated to be growing at a double-digit rate.\n3. Increasing incidences of catastrophic expenditure due to healthcare costs, which are a major contributor to poverty.\n4. Rising economic growth, which has enabled enhanced fiscal capacity."}]}

The training dataset should have a role system which defines what kind of chatbot are we going to build. User roles have the questions we asked. Answers are provided in the role assistant.

Llama index Pipeline

We are using llama index pipelines for generating the questions for every text chuck. We will read the given PDF document and convert it into multiple llama index documents.

PDFReader = download_loader("PDFReader")
loader = PDFReader()
documents = loader.load_data(file=Path(file_path))

We are splitting the PDF into different chunks. We are going to extract the questions from every chunk. Using the ingestion pipeline we will chain the text splitting and question extraction. We can run the ingestion pipeline parallel to reduce the question generation time.

# Logic for splitting the text to small nodes
text_splitter = TokenTextSplitter(separator=" ", chunk_size=512, chunk_overlap=128)
# Extracting 10 question from each node
qa_extractor = QuestionsAnsweredExtractor(questions=10)
pipeline = IngestionPipeline(transformations=[text_splitter, qa_extractor])

nodes = pipeline.run(
        documents=documents,
        in_place=True,
        show_progress=True,
    )

Lama index RAG

Using the in-memory vector store we will build a RAG pipeline using the given documents. We will iterate through the given question and generate its answer using the RAG.

# Creating a vector index for generating the answer
    index = VectorStoreIndex(documents)
    query_engine = index.as_query_engine(similarity_top_k=2)
    for node in nodes:
        for question in node.metadata["questions_this_excerpt_can_answer"].split("\n"):
            # Quering the index and generating the answer
            response = query_engine.query(question)
            store_generated_question_answer(question, response.response, path)

We are going to store the output of the RAG in a specific format recommended by the Turbo 3.5 chat model.

def store_generated_question_answer(question, answer, path):
    """
    Store the generated question and answers to a text file
    :param question: Generated Question
    :param answer: Generated Answer
    :param path: Path of the output file
    :return: None
    """
    with open(path, "a") as f:
        f.write(
            json.dumps(
                {
                    "messages": [
                        {
                            "role": "system",
                            "content": "Dolly is a chatbot who can answer questions about Indian national policy",
                        },
                        {"role": "user", "content": question},
                        {
                            "role": "assistant",
                            "content": answer,
                        },
                    ]
                }
            )+"\n"
        )

Sample Generated data

{"messages": [{"role": "system", "content": "Dolly is a chatbot who can answer questions about Indian national policy"}, {"role": "user", "content": "1. What are the major changes in the health sector since the last health policy in 2002?"}, {"role": "assistant", "content": "The major changes in the health sector since the last health policy in 2002 include: \n1. Changing health priorities, with a decline in maternal and child mortality but a growing burden of non-communicable diseases and some infectious diseases.\n2. Emergence of a robust healthcare industry that is estimated to be growing at a double-digit rate.\n3. Increasing incidences of catastrophic expenditure due to healthcare costs, which are a major contributor to poverty.\n4. Rising economic growth, which has enabled enhanced fiscal capacity."}]}
{"messages": [{"role": "system", "content": "Dolly is a chatbot who can answer questions about Indian national policy"}, {"role": "user", "content": "2. What are the current health priorities in India?"}, {"role": "assistant", "content": "The current health priorities in India include increasing life expectancy, reducing mortality rates (particularly for children and mothers), achieving and maintaining elimination status for diseases like HIV/AIDS, Leprosy, Kala-Azar, and Lymphatic Filariasis, reducing the prevalence of diseases like tuberculosis and blindness, and reducing premature mortality from cardiovascular diseases, cancer, diabetes, and chronic respiratory diseases. Additionally, there is a focus on improving health systems performance by increasing the utilization of public health facilities, ensuring high coverage of antenatal care, skilled attendance at birth, immunization, and family planning services. There are also goals related to health finance, health infrastructure, human resources, and health management information."}]}
{"messages": [{"role": "system", "content": "Dolly is a chatbot who can answer questions about Indian national policy"}, {"role": "user", "content": "3. How has the healthcare industry in India been growing?"}, {"role": "assistant", "content": "The healthcare industry in India has been growing by aligning the growth of the private healthcare sector with public health goals. The government aims to influence the operation and growth of the private healthcare sector and medical technologies to ensure alignment with public health goals. This includes strategic purchasing by the government to fill critical gaps in public health facilities, which creates a demand for the private healthcare sector. The goal is to make healthcare systems more effective, efficient, rational, safe, affordable, and ethical."}]}

Conclusion

In this tutorial, we discussed how to generate the training data from a PDF file using the llama index. We can generate data from any PDF file which is domain-specific to your use case. In the next tutorial, we will discuss how we can use the generated data to fine-tune the open AI models. If you have any questions, feel free to post your comments. Full source code is available in GitHub

Open AI: Selecting, calling and executing functions 

Introduction

We can pass multiple functions to open AI. Based on the input chat message, Open AI will do the following

  1. Choose which function/ functions to execute
  2. Pass the parameter based on the chat message
  3. Make sure that the data type is correct for the function

The actual function will be executed in the user’s computer or server, not Open AI completion API. The models which support the function calling are 

1. gpt-3.5-turbo-0125 

2. gpt-4-turbo-preview

How we used to call functions before?

Before the function calling feature, we used to pass the message to the chat GPT API and asked to extract the information in JSON format from the input message. The problem with this approach is that sometimes it will not provide the right JSON and correct data format. So we have to create the agent to try multiple times. With the help of function calling, it will improve the accuracy of the system

Use case

To explain the function calling, let’s take one example. Assume we are building a question-answer chatbot for school students. Students can ask multiple questions to the chatbot. The two kinds of questions they can ask are 

  1. They can apply for the leave by stating the reason. For eg – “I am not well today. I want to apply for leave today”. So we have to extract the information from the chat message to call the below-mentioned function
def apply_for_leave(number_of_days, reason, type_of_leave):
    """
    A function which calls the internal API to apply for leave
    :param number_of_days: Number of days required
    :param reason: Reason for leave
    :param type_of_leave: Sick or Holiday leave
    :return: Returns status for the leave
    """
    # Call the internal API for apply for the leave
    return json.dumps({"days": number_of_days, "reason": reason, "type": type_of_leave, "status": "approved"})

2. They can view the marks by providing their student ID. It will execute the below-mentioned function

def get_my_marks(registration_number):
    """
    Returns the marks for the given registration number
    :param registration_number: Registration id of the student
    :return: Returns the marks
    """
    # Call the internal API to get the marks
    return json.dumps({"registration_number": registration_number, "marks": {"CS": 90, "English": 89}})

Let’s see how we can select and execute the above-mentioned functions using open AI fiction calling,  

Steps of function calling

  1. Define the functions and their parameter in the following format
  2. Call the open AI model with the above parameters
  3. The model will choose which function/s to execute with the JSON input parameters
  4. Parse the JSON input, and make sure that it contains the right functions and parameters.
  5. Execute the function with the required parameters
  6. Pass the output to the open AI API to generate the final output from the function return value 
def run_conversation(message):
    # Step 1
    messages = [{"role": "user", "content": message}]
    tools = [
        {
            "type": "function",
            "function": {
                "name": "apply_for_leave",
                "description": "A function which calls the internal API to apply for leave ",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "number_of_days": {
                            "type": "integer",
                            "description": "Number of days required",
                        },
                        "reason": {
                            "type": "string",
                            "description": "Reason for leave",
                        },
                        "type_of_leave": {"type": "string", "enum": ["Sick Leave", "Holiday Leave"]},
                    },
                    "required": ["number_of_days", "reason", "type_of_leave"],
                },
            },
        },
        {
            "type": "function",
            "function": {
                "name": "get_my_marks",
                "description": "Returns the marks for the given registration number",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "registration_number": {
                            "type": "integer",
                            "description": "Registration id of the student",
                        }
                    },
                    "required": ["registration_number"],
                },
            },
        }
    ]
    Step 2
    response = client.chat.completions.create(
        model="gpt-3.5-turbo-0125",
        messages=messages,
        tools=tools,
        tool_choice="auto",  # auto is default, but we'll be explicit
    )
    response_message = response.choices[0].message
    tool_calls = response_message.tool_calls
    # Step 3
    if tool_calls:
        # Note: the JSON response may not always be valid; be sure to handle errors
            Step 4
            available_functions = {
            "apply_for_leave": apply_for_leave,
            "get_my_marks": get_my_marks,
        }  # only one function in this example, but you can have multiple
        messages.append(response_message)  # extend conversation with assistant's reply
        # Step 5
        for tool_call in tool_calls:
            function_name = tool_call.function.name
            function_to_call = available_functions[function_name]
            function_args = json.loads(tool_call.function.arguments)
            if function_name == "apply_for_leave":
                function_response = function_to_call(
                    number_of_days=function_args.get("number_of_days"),
                    reason=function_args.get("reason"),
                    type_of_leave=function_args.get("type_of_leave")
                )
            elif function_name == "get_my_marks":
                function_response = function_to_call(
                    registration_number=function_args.get("registration_number"),
                )
            messages.append(
                {
                    "tool_call_id": tool_call.id,
                    "role": "tool",
                    "name": function_name,
                    "content": function_response,
                }
            )  # extend conversation with function response
        second_response = client.chat.completions.create(
            model="gpt-3.5-turbo-0125",
            messages=messages,
        )  # get a new response from the model where it can see the function response
        return second_response

Output

function calling - output
Function calling output

Conclusion 

With the help of function calling we can choose and call the Python functions without any kind of error. It will improve the accuracy of the system. Under the hood, it is just passing the function Skelton to the chat GPT APIs to generate the JSON which selects and executes the function. It will be an expensive solution if you have a large number of functions and lots of parameters to extract from the chat message. It will also help us to build the state machines using the LangGraph which we will discuss in the upcoming articles. 

You can download the full code from this git repo. Feel free to comment if you have any questions 

Welcome to GenAI Explorer

Welcome to GenAI Explorer

Hi Everyone,

Welcome to GenAI Explorer!!
I am Nithin, a software engineer based out of India. I mainly work in Generative AI-related applications. I also have experience in cloud and big data. In this blog, I am sharing my knowledge and experience of building Generative AI applications. If you need any help in implementing solutions or applications or guidance you can connect me at [email protected].

Some of the topics I am planning to cover in this blog are

  1. Open AI APIs
  2. Langchain
  3. Lllama Index
  4. Vector databases
  5. Google Gemini
  6. Opensource LLM
  7. Hugging face
  8. Fine tuning
  9. Building Gen AI applications
  10. Text Embeddings

Feel free to mail me if you want to include any topics related to Generative AI.

We appreciate your feedback!!

Thanks You.

Nithin