DamienAI v0.10 – v0.32.2 Personal Project : David Sula

DamienAI is a personal project I started working on around October 2023. The general aim was to make an AI model that was as similar as possible to ChatGPT with the limited resources available to me: a good computer paired with an Nvidia GeForce RTX 3090, no money to fund my project, and a lot of misguided hope.

On my journey making Damien, I’ve made many updates, and this post will provide the general rundown on where I started and where I am as of March 3rd 2024.

v0.10: Initial Creation, vast data on whatever I could find easily accessible.

At first, I started by coding a basic training script which uses PyTorch, and the pre-trained “gpt2-medium” model with the tokenizer that comes along with it (I used the help of GPT4 ironically). This training script would get all the “.txt” files from the “data” directory, and merge them all into one text file with the titles of each text file being placed above their contents in the combined file. Then this script would tokenize the combined file (make it readable to the model) and train the model on it. Once finished, the script would delete the combined and tokenized files, and end.

train_model.py

import os
import pandas as pd
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer, TextDataset, DataCollatorForLanguageModeling, Trainer, TrainingArguments

def csvtotext(csvpath):
    # Read the CSV file
    df = pd.read_csv(csv_path)

    # Convert the dataframe to a single string
    # Assuming the CSV has a single column, adjust if it has multiple columns
    text_data = "\n".join(df.iloc[:, 0].dropna().tolist())

    return text_data

def train_gpt2():
    # Set the device to cuda:0
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    torch.cuda.set_device(device)
    print(f"Using device: {device}")

    # Load pre-trained model and tokenizer
    model = GPT2LMHeadModel.from_pretrained("gpt2-medium").to(device)
    tokenizer = GPT2Tokenizer.from_pretrained("gpt2-medium")

    # Get all .txt and .csv files in the data directory
    data_dir = "./data"
    txt_files = [f for f in os.listdir(data_dir) if f.endswith('.txt')]
    csv_files = [f for f in os.listdir(data_dir) if f.endswith('.csv')]

    # Concatenate all text files, using the filename as the subject
    combined_data = ""
    for txt_file in txt_files:
        subject = os.path.splitext(txt_file)[0]  # Filename without .txt extension
        try:
            with open(os.path.join(data_dir, txt_file), 'r', encoding='utf-8') as f:
                content = f.read()
                combined_data += f"[{subject}] {content}\n"
        except UnicodeDecodeError:
            print(f"Error reading {txt_file}. It might not be encoded in UTF-8. Skipping this file.")

    # Process .csv files
    for csv_file in csv_files:
        subject = os.path.splitext(csv_file)[0]  # Filename without .csv extension
        content = csv_to_text(os.path.join(data_dir, csv_file))
        combined_data += f"[{subject}] {content}\n"

    # Save the combined data to a temporary file
    temp_file_path = os.path.join(data_dir, "combined_training_data.txt")
    with open(temp_file_path, 'w', encoding='utf-8') as f:
        f.write(combined_data)

    # Prepare training dataset
    train_dataset = TextDataset(
        tokenizer=tokenizer,
        file_path=temp_file_path,
        block_size=128
    )

    # Data collator handles batching
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer,
        mlm=False
    )

    # Define training arguments
    training_args = TrainingArguments(
        output_dir="./trained_model",
        overwrite_output_dir=True,
        num_train_epochs=1,
        per_device_train_batch_size=32,
        save_steps=10_000,
        save_total_limit=2,
    )

    # Create Trainer instance
    trainer = Trainer(
        model=model,
        args=training_args,
        data_collator=data_collator,
        train_dataset=train_dataset,
    )

    # Train the model
    trainer.train()

    # Delete the cached file
    if os.path.exists(temp_file_path):
        os.remove(temp_file_path)

    # Delete the cached tokenizer file
    tokenizer_cache_file = os.path.join(data_dir, "cached_lm_GPT2Tokenizer_128_combined_training_data.txt")
    if os.path.exists(tokenizer_cache_file):
        os.remove(tokenizer_cache_file)

    # Explicitly save the model and its configuration
    trainer.save_model("./trained_model")

if __name == "__main":
    train_gpt2()
    input("Press Enter to continue...")

The data I initially trained the AI on was only 35.5 Megabytes, and it involved 90 Wikipedia documents which were scraped using a simple script which prompted for a link to a Wikipedia article and scraped it to a “.txt” file.

WikipediaScraper.py

# WikipediaScraper.py

import requests
from bs4 import BeautifulSoup
import tkinter as tk
from tkinter import simpledialog, messagebox

def scrape_wikipedia_page(url):
    response = requests.get(url)

    if response.status_code != 200:
        print(f"Failed to retrieve the webpage. Status code: {response.status_code}")
        return None

    soup = BeautifulSoup(response.content, 'html.parser')
    content_div = soup.find('div', {'id': 'mw-content-text'})
    for table in content_div.find_all('table'):
        table.decompose()

    page_text = content_div.get_text()
    return page_text

def save_content_to_file(page_name, content):
    filename = f"{page_name}_Wikipedia.txt"
    with open(filename, "w", encoding="utf-8") as file:
        file.write(content)

def main():
    root = tk.Tk()
    root.withdraw()  # Hide the main window

    while True:
        url = simpledialog.askstring("Input", "Please enter the Wikipedia URL:")

# If the user presses Cancel or closes the window, exit the loop
        if not url:
            break

        content = scrape_wikipedia_page(url)
        if content:
            page_name = url.split("/")[-1]
            save_content_to_file(page_name, content)
            messagebox.showinfo("Success", f"Content saved to {page_name}Wikipedia.txt")
        else:
            messagebox.showerror("Error", "Failed to scrape the content.")

if name == "_main":
    main()

The Wikipedia data totalled up to 10MB on its own. The rest of the data came from publicly accessible datasets, such as the subreddit Cornell, which came in a 10MB text file, some Joe Rogan Experience transcripts which were auto-generated by YouTube (and very poor), and a couple of other minuscule random sources.

The result was a really bad AI model that simply did not work. While training the AI model, there is a factor presented to the user called “train_loss”. The train loss is essentially how well the AI model understands the data it is being trained on, measured through the difference between the guesses that the AI makes as to what comes next in the data, and what actually is in the data. For reference, when the AI model reaches a train loss of 0.2, it can be stated that the AI understands the data it is being given very well. Anything around 0.5 means that the AI is generally understanding the data. Anything above 1.5 signals that the AI is dysfunctional or broken. DamienAI started, after being trained for the first time, with a train loss of over 3… This is entirely due to a lack of data.

This resulted in awful responses that made no sense from DamienAI. Essentially what it was doing was continuing from what I gave it. So if I messaged it a word that was mentioned in its training data, it would find whatever words often came after that word, and return them to me. Here are some examples. (I mentioned Rishi Sunak and Joe Rogan as it was trained on Wikipedia articles on those people.)

v0.20: Fine-tuning

On the OpenAI website, they publish documentation on the methods they use to train their ChatGPT. They train their AI in two steps. The first step is to train a blank slate on vast amounts of data; to OpenAI this involves over half a terabyte of text data. The next step is “fine-tuning”. This is where they custom-write their own data consisting of queries followed by responses, and train the AI on this data so that it learns the behaviour of an assistant or chatbot.

Having read this, I custom-wrote my own queries and responses to feed into my model. This was a 14KB “.txt” file which I trained DamienAI on several times.

Whilst these outputs were good, there were still a lot of errors, most commonly being the output being the same as the input, or the repetition of the same word over and over:

These errors were still primarily because I had trained the model on such a small amount of data.

v0.25: Feedback

The next step I thought of was to create a feedback system for the model so that if a good response is given, I can tell Damien that the response is good so that he is encouraged to repeat that output. My method for this was to code it so that after every response, the user was prompted to say whether or not the response was good. These results would be outputted to a “.csv” file, and the outputs rated “good” would be returned to the model for re-training, in hopes that these outputs would occur more often as a result.

v0.30: Fresh restart with Wikipedia Data.

I decided to take a step back and restart from scratch. I improved my Wikipedia scraping script so that now it would not only scrape Wikipedia articles, but it would scrape any hyperlinks in the articles it is scraping, as Wikipedia tends to use hyperlinks over key terms as these would direct the viewer to the Wikipedia article on that term. This resulted in around 20-30 Wikipedia articles being scraped per link that I provided to the script. As a result, this new set of data totalled up to 953MB of data – a huge improvement, although still nowhere near enough for the AI model as the train loss came out as 2.9. This was still a lot better than what was well over 3, although it did also show that I need to improve my methods of harvesting data massively.

I’m not going to show any responses from this model, as I hadn’t fine-tuned it so DamienAI would just be responding with whatever words it found associated in the Wikipedia articles with the words I provided.

v0.31

In this update, I simply added a “cache” folder so that the training script could dump its combined data file and its tokenized data file whilst it uses them for training, as well as providing it with a place to store “checkpoints”, which are saves of the model midway through its training, just in case something happens to the system and it stops abruptly.

v0.32: Organising

I began to rework how my data is stored, and I also began to seek data from more sources. My data directory went from being a dump of loads of unorganized text files to a clean organized directory which had several paths to different types of data.

Before:

After:

This data is very useful as it is just general conversation which is what I primarily want DamienAI to understand. Each podcast has about 200KB of text data, which is very good.

Additionally, I found out that Project Gutenberg provides free E-Books which you can just download in plain text, so this also provides large sums of text data. I managed to download 694MB of data from Project Gutenberg (964 E-Books, although they have over 70,000 on the website so downloading more is a must).

BBC News also became a source of text data, although very small, as each news article has around 5KB of data.

I did not finish this version, as I wanted to add Multiple GPU compatibility first so that I could use Google Cloud VMs which provide several powerful GPUs per VM.

v0.32.2: Multiple GPUs + more data

I updated my “train_model.py” script to use any available Nvidia Cuda capable GPUs on the system for training, as I was hoping to use Google Cloud VMs, although it is just not possible to use. I tried using Google Cloud VMs for a week and I was not capable of getting a single one throughout that week as they were never available. I tried all types of GPUs available, I also tried lowering the amount of GPUs I am using to a single one. I never got a chance to use it.

train_model.py

# train_model.py

import os
import pandas as pd
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer, TextDataset, DataCollatorForLanguageModeling, Trainer, TrainingArguments
import concurrent.futures

def processfile(filepath, basedir):
    file_name = os.path.basename(file_path)
    file_name_without_ext = os.path.splitext(file_name)[0]

    relative_path = os.path.relpath(file_path, start=base_dir)
    path_components = relative_path.split(os.sep)
    path_tags = path_components[:-1]

    # Adjusted to format directory path with ' > ' and place content on a new line
    tag = " > ".join(path_tags)

    try:
        with open(file_path, 'r', encoding='utf-8') as f:
            content = f.read().strip()
            # Adjusted to place the first line of content below the title
            return f"[{tag} > {file_name_without_ext}]\n{content}\n\n"
    except UnicodeDecodeError:
        print(f"Error reading {file_path}. It might not be encoded in UTF-8. Skipping this file.")
        return ""

def train_gpt2():
    if torch.cuda.is_available():
        device = torch.device("cuda")
        print(f"Using device: {device}, with {torch.cuda.device_count()} GPUs")
    else:
        print("CUDA is not available. Training on CPU.")
        device = torch.device("cpu")

    model = GPT2LMHeadModel.from_pretrained("gpt2-medium")
    tokenizer = GPT2Tokenizer.from_pretrained("gpt2-medium")

    # Enable Data Parallelism for multi-GPU training
    if torch.cuda.device_count() > 1:
        model = torch.nn.DataParallel(model)

    model.to(device)

    base_dir = "./data"
    all_files = [os.path.join(root, file) for root, dirs, files in os.walk(base_dir) for file in files if file.endswith(('.txt', '.csv'))]

    with concurrent.futures.ProcessPoolExecutor() as executor:
        combined_data_list = list(executor.map(process_file, all_files, [base_dir]*len(all_files)))

    combined_data = "".join(combined_data_list)

    cache_dir = "./cache"
    os.makedirs(cache_dir, exist_ok=True)

    temp_file_path = os.path.join(cache_dir, "combined_training_data.txt")
    with open(temp_file_path, 'w', encoding='utf-8') as f:
        f.write(combined_data)

    train_dataset = TextDataset(tokenizer=tokenizer, file_path=temp_file_path, block_size=128)
    data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

    training_args = TrainingArguments(
        output_dir="./trained_model",
        overwrite_output_dir=True,
        num_train_epochs=1,
        per_device_train_batch_size=36 // torch.cuda.device_count(),
        save_steps=20_000,
        save_total_limit=20,
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        data_collator=data_collator,
        train_dataset=train_dataset,
    )

    trainer.train()
    trainer.save_model("./trained_model")

if __name == "__main":
    train_gpt2()

In the end, I gave up on using Google Cloud as it was just not possible.

I did work on getting a bit more data and finally managed to pass the 1 gigabyte mark. The training data I used on v0.32.2 totalled up to 1.25GB. This included Wikipedia articles, Scholarpedia articles (Wikipedia alternative), documentation for open-source projects, documentation for coding languages, BBC News articles, and more.

In the end, I had to train the model on my computer using my RTX 3090. I had left it to train overnight (It took 11 hours) and the result was a train loss of 2.9. This goes to show I need much greater methods of collecting data, as 1.25GB is still nowhere remotely close to enough.

Training Screenshot

My next steps for v0.33 are to fine-tune this model just to see how it goes, and I will document my process and the results. For v0.4, I will aim to collect a lot more data.

David Sula

3 March 2024

Blogs