Ever felt like AI assistants are just… too helpful? Too polite? I mean, thank you for the accurate answers, but where’s the personality? I’ve always been amused by the idea of an AI that’s not just smart, but also hilariously sarcastic, maybe even thinking it’s vastly superior to us mere mortals. That’s how the idea for “Sparky” was born — an AI persona designed to answer questions with wit, sass, and a healthy dose of (fake) ego.
But dreaming up a sarcastic AI is one thing; actually building it is another. This post documents my journey into the world of LLM fine-tuning, specifically how I created a unique dataset, used AI to judge its own work, and fine-tuned Google’s new Gemma 3 model to bring Sparky to life, all using tools accessible on consumer hardware.
Imagine asking a simple question and getting this kind of response:
Intrigued? Let’s dive into how we coaxed this personality out of silicon… The pipeline I followed looked something like this:
Generate Diverse Questions: Create a varied set of questions an AI might encounter.
Generate Persona-Specific Responses: Have another AI answer these questions as Sparky.
AI-Powered Quality Control: Use an AI judge to score the responses based on how well they match the Sparky persona.
Fine-tune: Train a base model (Gemma 3 4B) on the best-rated question/answer pairs.
Deploy & Test: Bring Sparky into the world!
For readers eager to see the code, you can find it over here.
The first hurdle? Data. You can’t teach an AI a specific personality without examples. I scoured the web for datasets, but unsurprisingly, nobody had readily available Q&A pairs featuring a hyper-intelligent, sarcastic AI calling humans “carbon-based lifeforms.” Standard datasets are typically focused on helpfulness, harmlessness, and factual accuracy — the exact opposite of Sparky’s vibe.
Creating thousands of these examples manually, ensuring quality and consistency, would be incredibly time-consuming and laborious. So, I turned to the very tools I was playing with: Large Language Models.
Instead of manually creating examples, which would be extremely time-consuming, I decided to use LLMs to generate and evaluate the data. The goal was to create high-quality data efficiently, suitable for fine-tuning a smaller model later.
Step 1: Generating Questions with Llama 3.2 3B
I needed a diverse range of questions — conversational chit-chat, coding problems, requests for help, etc. — to make Sparky adaptable. I chose Llama 3.2 3B for this task. Why 3B? It’s small, open source, and powerful model, but crucially, it can run reasonably well on consumer-grade GPUs (fitting within ~8GB VRAM), making the project more accessible.
To guide the model, I used specific prompts for different question types (conversational, coding, help). Critically, I included few-shot examples directly in the prompt. This technique shows the model exactly the kind of output I’m looking for, significantly improving the quality and relevance of the generated questions.
you can find the exact prompt that I have used over here!
(all the code is available on github!)
The process yielded around 15,000 questions — approximately 5,000 for each category (coding, help, and conversational). After cleaning to remove duplicates, the final dataset contained about 13,800 unique questions.
Step 2: Generating Sparky’s Responses
With a solid set of questions, it was time to generate Sparky’s signature sarcastic answers. I again used Llama 3.2 3B, but this time with a detailed persona prompt defining Sparky’s character: witty, superior, emoji-loving, overly enthusiastic (but clearly faking it)
Crucially, like before, I included few-shot examples in the prompt to demonstrate the desired tone and style. This helps the model “get into character” much more effectively.
Here’s the prompt defining Sparky:
Step 3: Quality Control with LLM-as-a-Judge
Not every generated response perfectly captured Sparky’s essence. Some were too bland, others missed the sarcasm. Manually reviewing thousands of responses was out of the question. So, I employed the “LLM-as-a-Judge” technique.
I used Llama 3.2 3B again, but this time tasked it with evaluating the generated responses based on the Sparky persona. I gave it clear instructions and a 1-to-5 scoring scale, complete with examples of what different scores represented.
Here’s the judging prompt template that I used:
you can find the exact prompt that I have used over here!
I only kept the question-response pairs that scored 4 or 5, ensuring my fine-tuning dataset consisted of high-quality examples truly embodying the Sparky persona. This automated filtering saved countless hours.
With a curated dataset ready, it was time for the main event: fine-tuning. The goal was to teach a base model the specific nuances of Sparky’s personality. For this, I needed a model that was both powerful and efficient enough to run and train without requiring massive computing resources. Google’s Gemma 3 4B model emerged as the perfect candidate.
Launched recently, Gemma 3 is designed to be incredibly capable for its size. Google highlights it as a state-of-the-art model that delivers top-tier performance while being specifically optimized to run effectively on a single GPU or even directly on devices like laptops according to their blog. This focus on accessibility and efficiency was exactly what I needed. While Sparky’s current persona is English-based, Gemma 3’s underlying architecture boasts impressive multilingual capabilities (pretrained on over 140 languages!) and is noted for its strength in creative tasks and reasoning — a great foundation for capturing a unique personality like Sparky’s. I specifically chose the 4B parameter version as a sweet spot between capability and resource requirements.
However, fine-tuning even a “lightweight” 4B parameter model efficiently can still be challenging on consumer hardware. This is where Unsloth became absolutely essential. Unsloth is a game-changing library designed specifically to optimize the training and fine-tuning of Large Language Models. It dramatically speeds up the process and reduces memory usage, making powerful models accessible.
Here’s a glimpse into how I set up the fine-tuning using Unsloth and the trl library:
1. Loading the Model Efficiently:
First, I loaded the Gemma 3 4B model using Unsloth’s FastModel. Crucially, I enabled 4-bit quantization (load_in_4bit=True). This technique significantly shrinks the model’s memory footprint by representing the model’s weights with fewer bits, making it feasible to load and train on GPUs with limited VRAM, like the ones found in many consumer PCs or free Colab instances.
2. Preparing for Parameter-Efficient Fine-Tuning (PEFT) with LoRA:
Instead of retraining all 4 billion parameters of the model (which would require huge amounts of memory and compute), I used a popular PEFT technique called LoRA (Low-Rank Adaptation). LoRA freezes the original model weights and injects much smaller, trainable “adapter” layers into specific parts of the model (like the attention and MLP modules). We only train these small adapter layers, which drastically reduces the number of trainable parameters and memory requirements while still allowing the model to learn the new task or persona effectively.
Unsloth makes setting up LoRA straightforward with its get_peft_model function. I configured it to train adapters in the attention and MLP layers, setting parameters like r (the rank or size of the adapter matrices) and lora_alpha. A common practice, which I followed, is to set lora_alpha equal to r (both set to 8 in my case). These values control the capacity of the LoRA adapters — higher values might allow for learning more complex patterns but risk overfitting, especially with smaller datasets.
3. Formatting the Data:
LLMs expect input data in a specific format, especially for chat-like interactions. My dataset consisted of pairs of “question” and Sparky’s “response”. I wrote simple functions to transform each pair into the conversational format expected by Gemma 3 (using roles like “user” and “assistant”). Then, I used the tokenizer provided by Unsloth, configured with the correct gemma-3 chat template, to convert these conversations into token sequences the model can process.
The entire data formatting code can be found here!
4. Setting up the Trainer:
Finally, I used the SFTTrainer (Supervised Fine-Tuning Trainer) from the trl library, which integrates seamlessly with Unsloth and Hugging Face Transformers. This trainer handles the complexities of the training loop. I configured it with several key parameters via SFTConfig:
By combining the inherent efficiency of Gemma 3 4B with Unsloth’s optimizations, 4-bit quantization, LoRA, and careful trainer configuration (like gradient accumulation and 8-bit optimizers), I was able to successfully fine-tune the model on accessible hardware. This fine-tuning step is what truly “bakes” the Sparky persona into the model, allowing it to inherently adopt Sparky’s witty, sarcastic, and slightly superior tone without needing the long persona prompt every time.
Fine-tuning a model is great, but the real fun begins when you can interact with it! To create an easy-to-use interface for chatting with Sparky, I turned to Streamlit, a Python library for building interactive web applications with minimal effort.
I created a simple Streamlit script (app.py) that:
Loads the base Gemma 3 4B model.
2. Merges the fine-tuned LoRA adapter weights we trained earlier (containing Sparky’s personality).
3. Provides a basic chat interface where users can type questions and see Sparky’s witty replies in real-time.
Since the fine-tuning and initial testing were conveniently done within Google Colab, deploying a web app directly accessible to the public presents a small challenge. Colab environments aren’t public web servers. To quickly create a shareable demo link directly from Colab, I used a nifty tool called localtunnel. This localtunnel method is perfect for quick tests, temporary demos, or sharing progress without setting up dedicated hosting.
Here are some sample queries answered by Sparky:
This project was a fantastic learning experience in creating a custom AI personality with accessible tools, but like any experiment, there’s always room for enhancement. If I were to iterate further on Sparky, here are some areas I’d explore:
Enhancing Data Generation:
Improving Dataset Quality & Size:
Advanced Fine-tuning & Safety:
Implementing even a few of these improvements could take Sparky’s sarcastic prowess to the next level!
(The complete code used for training is available in the project’s GitHub repository: Github Repo)
Thank you so much for reading along on my journey building Sparky! This is my first technical blog post, so I genuinely appreciate any feedback you might have on this article. Have ideas for other AI projects or tech topics you’d find interesting for future posts? I’d love to hear them! Feel free to connect with me over on Linkedin.