Generating data for fine-tuning the LLM using the llama index

Introduction

Lots of modern applications are built using large language models(LLM). Most of the problems with the LLMs are that they need help understanding our data. We can use RAG to pass our data as context for LLM’s. There is always a limitation on context size and understanding our domain-specific language. The only solution to these kinds of problems is fine-tuning. Most of the developers want to fine-tune the either open source model or the chat GPT model. Availability of the data in the required format is a difficult problem. We can use the open-source data from Hugging Face, but the applications are limited. In this blog post, we are going to discuss how we can generate training data from our domain-specific documents.

Use Case

We are going to build a chatbot Dolly. We will fine-tune the dolly using the national health policy documents. In your case, the specialisation will be different but you can follow the same steps. You can use any PDF documents you want to fine-tune the model.

Open AI fine-tuning

In this example, we are going to fine-tune the Open AI chat model GPT 3.5 turbo. The training dataset should follow a specific structure which is given below.

{"messages": [{"role": "system", "content": "Dolly is a chatbot who can answer questions about Indian national policy"}, {"role": "user", "content": "1. What are the major changes in the health sector since the last health policy in 2002?"}, {"role": "assistant", "content": "The major changes in the health sector since the last health policy in 2002 include: \n1. Changing health priorities, with a decline in maternal and child mortality but a growing burden of non-communicable diseases and some infectious diseases.\n2. Emergence of a robust healthcare industry that is estimated to be growing at a double-digit rate.\n3. Increasing incidences of catastrophic expenditure due to healthcare costs, which are a major contributor to poverty.\n4. Rising economic growth, which has enabled enhanced fiscal capacity."}]}

The training dataset should have a role system which defines what kind of chatbot are we going to build. User roles have the questions we asked. Answers are provided in the role assistant.

Llama index Pipeline

We are using llama index pipelines for generating the questions for every text chuck. We will read the given PDF document and convert it into multiple llama index documents.

PDFReader = download_loader("PDFReader")
loader = PDFReader()
documents = loader.load_data(file=Path(file_path))

We are splitting the PDF into different chunks. We are going to extract the questions from every chunk. Using the ingestion pipeline we will chain the text splitting and question extraction. We can run the ingestion pipeline parallel to reduce the question generation time.

# Logic for splitting the text to small nodes
text_splitter = TokenTextSplitter(separator=" ", chunk_size=512, chunk_overlap=128)
# Extracting 10 question from each node
qa_extractor = QuestionsAnsweredExtractor(questions=10)
pipeline = IngestionPipeline(transformations=[text_splitter, qa_extractor])

nodes = pipeline.run(
        documents=documents,
        in_place=True,
        show_progress=True,
    )

Lama index RAG

Using the in-memory vector store we will build a RAG pipeline using the given documents. We will iterate through the given question and generate its answer using the RAG.

# Creating a vector index for generating the answer
    index = VectorStoreIndex(documents)
    query_engine = index.as_query_engine(similarity_top_k=2)
    for node in nodes:
        for question in node.metadata["questions_this_excerpt_can_answer"].split("\n"):
            # Quering the index and generating the answer
            response = query_engine.query(question)
            store_generated_question_answer(question, response.response, path)

We are going to store the output of the RAG in a specific format recommended by the Turbo 3.5 chat model.

def store_generated_question_answer(question, answer, path):
    """
    Store the generated question and answers to a text file
    :param question: Generated Question
    :param answer: Generated Answer
    :param path: Path of the output file
    :return: None
    """
    with open(path, "a") as f:
        f.write(
            json.dumps(
                {
                    "messages": [
                        {
                            "role": "system",
                            "content": "Dolly is a chatbot who can answer questions about Indian national policy",
                        },
                        {"role": "user", "content": question},
                        {
                            "role": "assistant",
                            "content": answer,
                        },
                    ]
                }
            )+"\n"
        )

Sample Generated data

{"messages": [{"role": "system", "content": "Dolly is a chatbot who can answer questions about Indian national policy"}, {"role": "user", "content": "1. What are the major changes in the health sector since the last health policy in 2002?"}, {"role": "assistant", "content": "The major changes in the health sector since the last health policy in 2002 include: \n1. Changing health priorities, with a decline in maternal and child mortality but a growing burden of non-communicable diseases and some infectious diseases.\n2. Emergence of a robust healthcare industry that is estimated to be growing at a double-digit rate.\n3. Increasing incidences of catastrophic expenditure due to healthcare costs, which are a major contributor to poverty.\n4. Rising economic growth, which has enabled enhanced fiscal capacity."}]}
{"messages": [{"role": "system", "content": "Dolly is a chatbot who can answer questions about Indian national policy"}, {"role": "user", "content": "2. What are the current health priorities in India?"}, {"role": "assistant", "content": "The current health priorities in India include increasing life expectancy, reducing mortality rates (particularly for children and mothers), achieving and maintaining elimination status for diseases like HIV/AIDS, Leprosy, Kala-Azar, and Lymphatic Filariasis, reducing the prevalence of diseases like tuberculosis and blindness, and reducing premature mortality from cardiovascular diseases, cancer, diabetes, and chronic respiratory diseases. Additionally, there is a focus on improving health systems performance by increasing the utilization of public health facilities, ensuring high coverage of antenatal care, skilled attendance at birth, immunization, and family planning services. There are also goals related to health finance, health infrastructure, human resources, and health management information."}]}
{"messages": [{"role": "system", "content": "Dolly is a chatbot who can answer questions about Indian national policy"}, {"role": "user", "content": "3. How has the healthcare industry in India been growing?"}, {"role": "assistant", "content": "The healthcare industry in India has been growing by aligning the growth of the private healthcare sector with public health goals. The government aims to influence the operation and growth of the private healthcare sector and medical technologies to ensure alignment with public health goals. This includes strategic purchasing by the government to fill critical gaps in public health facilities, which creates a demand for the private healthcare sector. The goal is to make healthcare systems more effective, efficient, rational, safe, affordable, and ethical."}]}

Conclusion

In this tutorial, we discussed how to generate the training data from a PDF file using the llama index. We can generate data from any PDF file which is domain-specific to your use case. In the next tutorial, we will discuss how we can use the generated data to fine-tune the open AI models. If you have any questions, feel free to post your comments. Full source code is available in GitHub