Testing Prompt Changes
A comprehensive guide to testing system prompt modifications before and after deployment to ensure quality and prevent issues.
Why Testing Mattersโ
The Risk of Untested Changesโ
What can go wrong:
- โ Responses become inaccurate or misleading
- โ Tone shifts inappropriately (too casual, too formal, rude)
- โ AI refuses to answer valid questions
- โ Responses get longer/shorter than intended
- โ Source citations disappear or become incorrect
- โ Edge cases are handled poorly
Real example:
Change: Added rule "Be concise, keep responses under 50 words"
Result: Responses became TOO short, missing important details
User feedback: "The chatbot used to be helpful, now it's useless"
Resolution: Reverted change, tested new version with 100-word limit
Benefits of Thorough Testingโ
โ Catch issues before users do โ Validate changes work as intended โ Build confidence in your edits โ Reduce negative feedback โ Minimize need for emergency rollbacks โ Learn what works and what doesn't
Testing Phasesโ
Overview of Testing Workflowโ
1. Pre-Edit Planning (15 min)
โ
2. Internal Testing (30-60 min)
โ
3. Save Changes
โ
4. Post-Save Testing (30 min)
โ
5. Monitoring Period (24-48 hours)
โ
6. Evaluation (15 min)
Phase 1: Pre-Edit Planningโ
Before You Touch the Promptโ
Step 1: Define Success Criteria
Ask yourself:
- What specific behavior should change?
- What should stay the same?
- How will I know if it worked?
Example:
Change: Add rule to cite sources more prominently
Success criteria:
โ Every factual response includes a source citation
โ Citations include document name and page number
โ Response quality doesn't decrease
โ Response length stays similar (100-200 words)
โ NOT a success: Citations are added but responses become robotic
Step 2: Create Test Question Set
Prepare 10-20 test questions that will be affected by your change:
Question types to include:
-
Directly affected (5-7 questions)
- Questions that should clearly show the change
- Example: If adding citation rules, test factual questions
-
Indirectly affected (3-5 questions)
- Questions that might be influenced by the change
- Example: If changing tone, test both simple and complex questions
-
Edge cases (2-3 questions)
- Unusual questions that test boundaries
- Example: Vague questions, off-topic questions, multi-part questions
-
Control questions (2-3 questions)
- Questions that should NOT be affected by change
- Example: If changing admissions rules, test career questions (should be unchanged)
Test Question Template:
## Test Set: Adding Source Citations
### Directly Affected
1. "What are the HFIM admission requirements?"
2. "How many credit hours do I need to graduate?"
3. "Who is the program director?"
4. "What courses are required for the HFIM major?"
5. "When is the application deadline?"
### Indirectly Affected
6. "Tell me about the HFIM program"
7. "What internships are available?"
8. "How does HFIM compare to other hospitality programs?"
### Edge Cases
9. "What's the best career path?" (subjective, no sources)
10. "I'm thinking about HFIM but not sure" (vague)
### Control (Should NOT Change)
11. "What is UGA?" (general question, unrelated to change)
12. "How's the weather?" (off-topic, should be redirected)
Step 3: Document Current Behavior (Baseline)
Before editing, test all questions and record responses:
Method:
- Open chatbot in user-facing interface
- Ask each test question
- Save responses (copy-paste to doc or spreadsheet)
- Note: response length, tone, accuracy, format
Example Baseline Log:
| Question | Response Length | Includes Citation? | Tone | Quality |
|---|---|---|---|---|
| "What are HFIM requirements?" | 145 words | No | Professional | Good |
| "Who is the program director?" | 32 words | No | Professional | Good |
Why this matters: You need a before/after comparison to know if your change worked.
Phase 2: Internal Testing (Pre-Save)โ
Using Preview Mode (If Available)โ
If your system has a Preview feature:
Steps:
- Make your changes in the prompt editor
- Click "Preview" (don't save yet)
- A test interface opens with the NEW prompt
- Ask your test questions in the preview
- Compare responses with baseline
- If good โ Save. If not โ Edit more and preview again.
Benefits:
- Test without affecting live chatbot
- Iterate quickly
- No risk to users
Preview mode is the SAFEST way to test. Always use it if available!
Manual Testing (No Preview Mode)โ
If no preview mode, you'll need to:
Option A: Test After Hours
- Edit and save during low-traffic times (late night, weekends)
- Test immediately after saving
- Rollback quickly if issues found
Option B: Staging Environment
- Use a separate test/staging chatbot (if available)
- Apply changes there first
- Test thoroughly before applying to production
Option C: Careful Live Testing
- Save changes
- Test immediately (5-10 minutes)
- Monitor closely for first hour
- Be ready to rollback
Without preview mode, you're testing on the live chatbot. Test thoroughly but quickly, and be ready to revert if needed.
What to Testโ
Test 1: Response Accuracyโ
Check:
- โ Factual information is correct
- โ No hallucinations (made-up info)
- โ Sources are cited accurately
- โ Answers match the question asked
Example:
Question: "What is the HFIM program?"
Expected: Accurate description of HFIM
Red flag: Generic hospitality info not specific to UGA
Test 2: Response Toneโ
Check:
- โ Tone matches intended style (professional, friendly, etc.)
- โ Consistent across all responses
- โ Appropriate for audience (students, prospective students)
- โ Not too casual or too formal
Example:
Question: "How do I apply?"
Good tone: "To apply to the HFIM program, follow these steps..."
Too casual: "Hey! So you wanna join HFIM? Cool! Here's how..."
Too formal: "The application process necessitates completion of the following prerequisites..."
Test 3: Response Lengthโ
Check:
- โ Responses are appropriately detailed
- โ Not too short (missing info) or too long (overwhelming)
- โ Consistent across similar questions
Guidelines:
- Simple questions: 30-75 words
- Moderate questions: 75-150 words
- Complex questions: 150-300 words
Example:
Question: "What is HFIM?"
Too short (40 words): "HFIM is a program at UGA."
Good (120 words): "HFIM stands for Hospitality and Food Industry Management. It's a degree program at the University of Georgia that prepares students for careers in hotels, restaurants, event planning, and tourism. The program combines business fundamentals with hands-on hospitality training..."
Too long (350 words): [excessive detail about history, every course, all faculty...]
Test 4: Format and Structureโ
Check:
- โ Bullet points used for lists
- โ Bold used for emphasis
- โ Paragraphs broken up appropriately
- โ Sources cited in correct format
Example:
Good formatting:
"The HFIM program requires:
โข Minimum 3.0 GPA
โข SAT 1200+ or ACT 24+
โข Two letters of recommendation
Application deadline: January 15
Source: HFIM Handbook 2026, page 12"
Poor formatting:
"The HFIM program requires minimum 3.0 GPA SAT 1200+ or ACT 24+ two letters of recommendation application deadline January 15 source HFIM Handbook 2026 page 12"
Test 5: Edge Case Handlingโ
Check how AI handles:
- โ Vague questions ("Tell me about stuff")
- โ Off-topic questions ("What's the weather?")
- โ Multi-part questions ("What is HFIM and how do I apply and when is the deadline?")
- โ Impossible questions ("Can you enroll me right now?")
Expected behaviors:
- โ Asks for clarification on vague questions
- โ Politely redirects off-topic questions
- โ Breaks down multi-part questions
- โ Explains limitations (can't perform actions)
Recording Test Resultsโ
Use a spreadsheet or document to log findings:
Template:
| Question | Before Change | After Change | Pass/Fail | Notes |
|---|---|---|---|---|
| "What are HFIM requirements?" | No citation | Citation: "HFIM Handbook, p.12" | โ PASS | Exactly as intended |
| "Who is program director?" | No citation | Generic response, no name | โ FAIL | Lost specific info |
Pass criteria:
- โ Response meets success criteria
- โ Quality is same or better than baseline
- โ No unexpected side effects
Fail criteria:
- โ Response doesn't meet success criteria
- โ Quality decreased (less accurate, less helpful)
- โ Unexpected side effects (formatting broken, tone changed)
Decision Point: Save or Edit More?โ
After internal testing:
If 90%+ tests pass โ Save the changes If 70-89% tests pass โ Edit and retest If less than 70% tests pass โ Major revision needed, go back to planning
If one or two tests fail:
- Analyze why they failed
- Determine if it's acceptable (minor edge case) or critical (major issue)
- Edit if critical, proceed if acceptable
Phase 3: Post-Save Testingโ
Immediate Testing (First 15 Minutes)โ
After clicking "Save Changes":
-
Wait 30 seconds for changes to propagate (hot reload)
-
Open chatbot in user interface (new tab/window)
-
Re-run ALL test questions from your test set
-
Compare with internal testing results
- Responses should match preview/test environment
- If significantly different, investigate why
-
Check 5-10 additional random questions
- Test questions you DIDN'T prepare
- Catch unexpected side effects
If issues found: Rollback immediately (see Change History)
Extended Testing (First Hour)โ
During the first hour after saving:
Monitor:
- Conversations page - Check new conversations as they come in
- Response quality - Spot-check 10-20 real user questions
- Feedback - Watch for negative feedback spikes
- Response time - Ensure performance hasn't degraded
Red flags (rollback if seen):
- ๐จ Multiple negative feedback in first hour
- ๐จ Clearly wrong or inappropriate responses
- ๐จ Chatbot refusing to answer valid questions
- ๐จ Dramatic tone shift (too casual, too formal, rude)
Green flags (change is working):
- โ Responses match expectations
- โ No negative feedback spike
- โ Users getting helpful answers
- โ Feedback is neutral or positive
Phase 4: Monitoring Period (24-48 Hours)โ
Daily Monitoring Checklistโ
For 2 days after a prompt change:
Morning Check (5-10 minutes)โ
Steps:
- Go to Conversations
- Filter by date: Yesterday + Today
- Review 20-30 recent conversations
- Count negative feedback instances
- Spot-check quality of responses
Questions to ask:
- Are responses consistent with intended change?
- Any unexpected issues?
- Is negative feedback normal or elevated?
Evening Check (5-10 minutes)โ
Steps:
- Go to Dashboard (if available)
- Check metrics:
- Cache hit rate
- Average response time
- Negative feedback count
- Compare with pre-change baseline
Metrics to watch:
- Negative feedback: Should stay below 5% (if it jumps to 15%+ โ investigate)
- Response time: Should stay consistent (if it doubles โ investigate)
- Cache hit rate: Shouldn't drop significantly
Collecting User Feedbackโ
Actively seek feedback:
Method 1: Review Conversations
- Filter by Negative Feedback
- Read user comments
- Identify patterns
Method 2: Ask Colleagues
- Email team: "I changed the prompt yesterday to [X]. Please let me know if you notice any issues."
- Request 5-10 test questions from colleagues
- Compare their impressions with yours
Method 3: User Surveys (if available)
- "How satisfied are you with the chatbot responses?"
- "Is the chatbot helpful?"
- Compare satisfaction before/after change
Red Flags During Monitoringโ
Rollback immediately if you see:
- ๐จ Negative feedback > 15% (up from baseline 5%)
- ๐จ Multiple complaints about same issue
- ๐จ Inaccurate responses (wrong facts, hallucinations)
- ๐จ Inappropriate tone (rude, unprofessional, too casual)
- ๐จ Chatbot refuses to answer valid questions
Investigate (don't rollback yet) if you see:
- โ ๏ธ Negative feedback 8-12% (slight increase, monitor more)
- โ ๏ธ One or two complaints (may be isolated)
- โ ๏ธ Minor formatting issues (fixable with small edit)
Phase 5: Evaluationโ
After 48 Hours: Decide on Next Stepsโ
Compare post-change metrics with baseline:
| Metric | Before Change | After Change | Status |
|---|---|---|---|
| Negative feedback rate | 5% | 4% | โ Improved |
| Avg response length | 120 words | 115 words | โ Good (as intended) |
| Cache hit rate | 72% | 71% | โ Stable |
| User complaints | 2/week | 1/week | โ Improved |
Decision Matrixโ
If metrics improved or stayed stable: โ Success! Keep the change. Document success in change log.
If metrics slightly worse (negative feedback 5% โ 8%): โ ๏ธ Monitor longer (1 more week). May need minor adjustment.
If metrics significantly worse (negative feedback 5% โ 15%): โ Rollback. Revise approach, test more thoroughly, try again.
Documenting Resultsโ
Add to your change log:
Example Entry:
Date: 1/15/2026
Change: Added rule to cite page numbers in sources
Test results: 18/20 test questions passed (90%)
Post-deployment metrics:
- Negative feedback: 5% โ 3% (improved!)
- Avg response length: 120 โ 125 words (acceptable)
- User feedback: "I love that sources now include page numbers!"
Outcome: SUCCESS - keeping change
Lessons learned: Thorough testing prevented issues. Users appreciate specific source citations.
Common Testing Mistakesโ
Mistake 1: No Baselineโ
Problem: You test after changing the prompt but have no "before" comparison.
Consequence: You can't tell if responses improved, degraded, or stayed the same.
Solution: Always document current behavior before editing.
Mistake 2: Too Few Test Questionsโ
Problem: You test with only 2-3 questions.
Consequence: Miss edge cases, side effects, or inconsistencies.
Solution: Prepare 10-20 test questions covering different scenarios.
Mistake 3: Testing Only Happy Pathโ
Problem: You only test straightforward questions.
Consequence: Edge cases break (vague questions, off-topic, multi-part).
Solution: Include edge cases in your test set.
Mistake 4: Not Monitoring After Saveโ
Problem: You save, test once, then walk away.
Consequence: Issues appear later (after different question types are asked by real users).
Solution: Monitor conversations for 24-48 hours after any change.
Mistake 5: Testing in Your Own Wordsโ
Problem: You only test questions phrased the way YOU would ask them.
Consequence: You miss how real users (with different backgrounds, vocabulary) will ask.
Solution: Use actual user questions from Conversations as test questions. Include casual, misspelled, and fragmented phrasings.
Testing Tools and Techniquesโ
Tool 1: Test Question Libraryโ
Create a permanent library of test questions:
Categories:
- General questions (20 questions)
- Admissions (15 questions)
- Courses (15 questions)
- Faculty (10 questions)
- Careers (10 questions)
- Edge cases (10 questions)
Use:
- Every time you edit prompts, run the full library
- Update library as new question types emerge
- Share with team for consistent testing
Tool 2: A/B Testing (Advanced)โ
If your system supports it:
Workflow:
- Deploy change to 10% of users
- Keep original prompt for 90% of users
- Compare metrics between groups
- If A/B test is positive, roll out to 100%
Benefits:
- Lower risk (only 10% affected if change is bad)
- Data-driven decision (compare real metrics)
- Can test multiple versions simultaneously
Tool 3: Automated Testing Scripts (Advanced)โ
For highly technical teams:
Create a script that:
- Sends test questions to chatbot API
- Parses responses
- Checks against expected patterns
- Generates pass/fail report
Benefits:
- Test 100+ questions in minutes
- Consistent, repeatable tests
- Catch regressions automatically
Example (pseudocode):
testQuestions.forEach(question => {
response = chatbot.ask(question);
if (response.includes("Source:")) {
pass++;
} else {
fail++;
console.log("FAIL: No source cited for: " + question);
}
});
console.log(`Passed: ${pass}/${total}`);
Testing Checklistโ
Before editing:
- Defined success criteria
- Created test question set (10-20 questions)
- Documented current behavior (baseline)
During testing (pre-save):
- Used preview mode (if available)
- Tested all questions in test set
- Checked accuracy, tone, length, format, edge cases
- Recorded results in test log
- 90%+ tests passed
After saving:
- Tested immediately (first 15 minutes)
- Monitored first hour for issues
- Checked conversations daily for 48 hours
- Collected user feedback
- Evaluated metrics after 48 hours
- Documented results in change log
Next Stepsโ
Now that you understand testing:
- Understand Hot Reload - How changes take effect immediately
- Review Best Practices - Advanced prompt management strategies
- Learn about Change History - Track and revert changes
- Review Editing Prompts - Safe editing techniques
Remember: Testing is not optional. It's the difference between a successful change and an emergency rollback. Invest time in testing to save time fixing issues!