Testing Prompt Changes

A comprehensive guide to testing system prompt modifications before and after deployment to ensure quality and prevent issues.

Why Testing Matters

The Risk of Untested Changes

What can go wrong:

❌ Responses become inaccurate or misleading
❌ Tone shifts inappropriately (too casual, too formal, rude)
❌ AI refuses to answer valid questions
❌ Responses get longer/shorter than intended
❌ Source citations disappear or become incorrect
❌ Edge cases are handled poorly

Real example:

Change: Added rule "Be concise, keep responses under 50 words"
Result: Responses became TOO short, missing important details
User feedback: "The chatbot used to be helpful, now it's useless"
Resolution: Reverted change, tested new version with 100-word limit

Benefits of Thorough Testing

✅ Catch issues before users do ✅ Validate changes work as intended ✅ Build confidence in your edits ✅ Reduce negative feedback ✅ Minimize need for emergency rollbacks ✅ Learn what works and what doesn't

Testing Phases

Overview of Testing Workflow

1. Pre-Edit Planning (15 min)
   ↓
2. Internal Testing (30-60 min)
   ↓
3. Save Changes
   ↓
4. Post-Save Testing (30 min)
   ↓
5. Monitoring Period (24-48 hours)
   ↓
6. Evaluation (15 min)

Phase 1: Pre-Edit Planning

Before You Touch the Prompt

Step 1: Define Success Criteria

Ask yourself:

What specific behavior should change?
What should stay the same?
How will I know if it worked?

Example:

Change: Add rule to cite sources more prominently
Success criteria:
✓ Every factual response includes a source citation
✓ Citations include document name and page number
✓ Response quality doesn't decrease
✓ Response length stays similar (100-200 words)
✗ NOT a success: Citations are added but responses become robotic

Step 2: Create Test Question Set

Prepare 10-20 test questions that will be affected by your change:

Question types to include:

Directly affected (5-7 questions)
- Questions that should clearly show the change
- Example: If adding citation rules, test factual questions
Indirectly affected (3-5 questions)
- Questions that might be influenced by the change
- Example: If changing tone, test both simple and complex questions
Edge cases (2-3 questions)
- Unusual questions that test boundaries
- Example: Vague questions, off-topic questions, multi-part questions
Control questions (2-3 questions)
- Questions that should NOT be affected by change
- Example: If changing admissions rules, test career questions (should be unchanged)

Test Question Template:

## Test Set: Adding Source Citations

### Directly Affected
"What are the HFIM admission requirements?"
"How many credit hours do I need to graduate?"
"Who is the program director?"
"What courses are required for the HFIM major?"
"When is the application deadline?"

### Indirectly Affected
"Tell me about the HFIM program"
"What internships are available?"
"How does HFIM compare to other hospitality programs?"

### Edge Cases
"What's the best career path?" (subjective, no sources)
"I'm thinking about HFIM but not sure" (vague)

### Control (Should NOT Change)
"What is UGA?" (general question, unrelated to change)
"How's the weather?" (off-topic, should be redirected)

Step 3: Document Current Behavior (Baseline)

Before editing, test all questions and record responses:

Method:

Open chatbot in user-facing interface
Ask each test question
Save responses (copy-paste to doc or spreadsheet)
Note: response length, tone, accuracy, format

Example Baseline Log:

Question	Response Length	Includes Citation?	Tone	Quality
"What are HFIM requirements?"	145 words	No	Professional	Good
"Who is the program director?"	32 words	No	Professional	Good

Why this matters: You need a before/after comparison to know if your change worked.

Phase 2: Internal Testing (Pre-Save)

Using Preview Mode (If Available)

If your system has a Preview feature:

Steps:

Make your changes in the prompt editor
Click "Preview" (don't save yet)
A test interface opens with the NEW prompt
Ask your test questions in the preview
Compare responses with baseline
If good → Save. If not → Edit more and preview again.

Benefits:

Test without affecting live chatbot
Iterate quickly
No risk to users

tip

Preview mode is the SAFEST way to test. Always use it if available!

Manual Testing (No Preview Mode)

If no preview mode, you'll need to:

Option A: Test After Hours

Edit and save during low-traffic times (late night, weekends)
Test immediately after saving
Rollback quickly if issues found

Option B: Staging Environment

Use a separate test/staging chatbot (if available)
Apply changes there first
Test thoroughly before applying to production

Option C: Careful Live Testing

Save changes
Test immediately (5-10 minutes)
Monitor closely for first hour
Be ready to rollback

warning

Without preview mode, you're testing on the live chatbot. Test thoroughly but quickly, and be ready to revert if needed.

What to Test

Test 1: Response Accuracy

Check:

✅ Factual information is correct
✅ No hallucinations (made-up info)
✅ Sources are cited accurately
✅ Answers match the question asked

Example:

Question: "What is the HFIM program?"
Expected: Accurate description of HFIM
Red flag: Generic hospitality info not specific to UGA

Test 2: Response Tone

Check:

✅ Tone matches intended style (professional, friendly, etc.)
✅ Consistent across all responses
✅ Appropriate for audience (students, prospective students)
✅ Not too casual or too formal

Example:

Question: "How do I apply?"
Good tone: "To apply to the HFIM program, follow these steps..."
Too casual: "Hey! So you wanna join HFIM? Cool! Here's how..."
Too formal: "The application process necessitates completion of the following prerequisites..."

Test 3: Response Length

Check:

✅ Responses are appropriately detailed
✅ Not too short (missing info) or too long (overwhelming)
✅ Consistent across similar questions

Guidelines:

Simple questions: 30-75 words
Moderate questions: 75-150 words
Complex questions: 150-300 words

Example:

Question: "What is HFIM?"
Too short (40 words): "HFIM is a program at UGA."
Good (120 words): "HFIM stands for Hospitality and Food Industry Management. It's a degree program at the University of Georgia that prepares students for careers in hotels, restaurants, event planning, and tourism. The program combines business fundamentals with hands-on hospitality training..."
Too long (350 words): [excessive detail about history, every course, all faculty...]

Test 4: Format and Structure

Check:

✅ Bullet points used for lists
✅ Bold used for emphasis
✅ Paragraphs broken up appropriately
✅ Sources cited in correct format

Example:

Good formatting:
"The HFIM program requires:
• Minimum 3.0 GPA
• SAT 1200+ or ACT 24+
• Two letters of recommendation

Application deadline: January 15

Source: HFIM Handbook 2026, page 12"

Poor formatting:
"The HFIM program requires minimum 3.0 GPA SAT 1200+ or ACT 24+ two letters of recommendation application deadline January 15 source HFIM Handbook 2026 page 12"

Test 5: Edge Case Handling

Check how AI handles:

❓ Vague questions ("Tell me about stuff")
❓ Off-topic questions ("What's the weather?")
❓ Multi-part questions ("What is HFIM and how do I apply and when is the deadline?")
❓ Impossible questions ("Can you enroll me right now?")

Expected behaviors:

✅ Asks for clarification on vague questions
✅ Politely redirects off-topic questions
✅ Breaks down multi-part questions
✅ Explains limitations (can't perform actions)

Recording Test Results

Use a spreadsheet or document to log findings:

Template:

Question	Before Change	After Change	Pass/Fail	Notes
"What are HFIM requirements?"	No citation	Citation: "HFIM Handbook, p.12"	✅ PASS	Exactly as intended
"Who is program director?"	No citation	Generic response, no name	❌ FAIL	Lost specific info

Pass criteria:

✅ Response meets success criteria
✅ Quality is same or better than baseline
✅ No unexpected side effects

Fail criteria:

❌ Response doesn't meet success criteria
❌ Quality decreased (less accurate, less helpful)
❌ Unexpected side effects (formatting broken, tone changed)

Decision Point: Save or Edit More?

After internal testing:

If 90%+ tests pass → Save the changes If 70-89% tests pass → Edit and retest If less than 70% tests pass → Major revision needed, go back to planning

If one or two tests fail:

Analyze why they failed
Determine if it's acceptable (minor edge case) or critical (major issue)
Edit if critical, proceed if acceptable

Phase 3: Post-Save Testing

Immediate Testing (First 15 Minutes)

After clicking "Save Changes":

Wait 30 seconds for changes to propagate (hot reload)
Open chatbot in user interface (new tab/window)
Re-run ALL test questions from your test set
Compare with internal testing results
- Responses should match preview/test environment
- If significantly different, investigate why
Check 5-10 additional random questions
- Test questions you DIDN'T prepare
- Catch unexpected side effects

If issues found: Rollback immediately (see Change History)

Extended Testing (First Hour)

During the first hour after saving:

Monitor:

Conversations page - Check new conversations as they come in
Response quality - Spot-check 10-20 real user questions
Feedback - Watch for negative feedback spikes
Response time - Ensure performance hasn't degraded

Red flags (rollback if seen):

🚨 Multiple negative feedback in first hour
🚨 Clearly wrong or inappropriate responses
🚨 Chatbot refusing to answer valid questions
🚨 Dramatic tone shift (too casual, too formal, rude)

Green flags (change is working):

✅ Responses match expectations
✅ No negative feedback spike
✅ Users getting helpful answers
✅ Feedback is neutral or positive

Phase 4: Monitoring Period (24-48 Hours)

Daily Monitoring Checklist

For 2 days after a prompt change:

Morning Check (5-10 minutes)

Steps:

Go to Conversations
Filter by date: Yesterday + Today
Review 20-30 recent conversations
Count negative feedback instances
Spot-check quality of responses

Questions to ask:

Are responses consistent with intended change?
Any unexpected issues?
Is negative feedback normal or elevated?

Evening Check (5-10 minutes)

Steps:

Go to Dashboard (if available)
Check metrics:
- Cache hit rate
- Average response time
- Negative feedback count
Compare with pre-change baseline

Metrics to watch:

Negative feedback: Should stay below 5% (if it jumps to 15%+ → investigate)
Response time: Should stay consistent (if it doubles → investigate)
Cache hit rate: Shouldn't drop significantly

Collecting User Feedback

Actively seek feedback:

Method 1: Review Conversations

Filter by Negative Feedback
Read user comments
Identify patterns

Method 2: Ask Colleagues

Email team: "I changed the prompt yesterday to [X]. Please let me know if you notice any issues."
Request 5-10 test questions from colleagues
Compare their impressions with yours

Method 3: User Surveys (if available)

"How satisfied are you with the chatbot responses?"
"Is the chatbot helpful?"
Compare satisfaction before/after change

Red Flags During Monitoring

Rollback immediately if you see:

🚨 Negative feedback > 15% (up from baseline 5%)
🚨 Multiple complaints about same issue
🚨 Inaccurate responses (wrong facts, hallucinations)
🚨 Inappropriate tone (rude, unprofessional, too casual)
🚨 Chatbot refuses to answer valid questions

Investigate (don't rollback yet) if you see:

⚠️ Negative feedback 8-12% (slight increase, monitor more)
⚠️ One or two complaints (may be isolated)
⚠️ Minor formatting issues (fixable with small edit)

Phase 5: Evaluation

After 48 Hours: Decide on Next Steps

Compare post-change metrics with baseline:

Metric	Before Change	After Change	Status
Negative feedback rate	5%	4%	✅ Improved
Avg response length	120 words	115 words	✅ Good (as intended)
Cache hit rate	72%	71%	✅ Stable
User complaints	2/week	1/week	✅ Improved

Decision Matrix

If metrics improved or stayed stable: ✅ Success! Keep the change. Document success in change log.

If metrics slightly worse (negative feedback 5% → 8%): ⚠️ Monitor longer (1 more week). May need minor adjustment.

If metrics significantly worse (negative feedback 5% → 15%): ❌ Rollback. Revise approach, test more thoroughly, try again.

Documenting Results

Add to your change log:

Example Entry:

Date: 1/15/2026
Change: Added rule to cite page numbers in sources
Test results: 18/20 test questions passed (90%)
Post-deployment metrics:
  - Negative feedback: 5% → 3% (improved!)
  - Avg response length: 120 → 125 words (acceptable)
  - User feedback: "I love that sources now include page numbers!"
Outcome: SUCCESS - keeping change
Lessons learned: Thorough testing prevented issues. Users appreciate specific source citations.

Common Testing Mistakes

Mistake 1: No Baseline

Problem: You test after changing the prompt but have no "before" comparison.

Consequence: You can't tell if responses improved, degraded, or stayed the same.

Solution: Always document current behavior before editing.

Mistake 2: Too Few Test Questions

Problem: You test with only 2-3 questions.

Consequence: Miss edge cases, side effects, or inconsistencies.

Solution: Prepare 10-20 test questions covering different scenarios.

Mistake 3: Testing Only Happy Path

Problem: You only test straightforward questions.

Consequence: Edge cases break (vague questions, off-topic, multi-part).

Solution: Include edge cases in your test set.

Mistake 4: Not Monitoring After Save

Problem: You save, test once, then walk away.

Consequence: Issues appear later (after different question types are asked by real users).

Solution: Monitor conversations for 24-48 hours after any change.

Mistake 5: Testing in Your Own Words

Problem: You only test questions phrased the way YOU would ask them.

Consequence: You miss how real users (with different backgrounds, vocabulary) will ask.

Solution: Use actual user questions from Conversations as test questions. Include casual, misspelled, and fragmented phrasings.

Testing Tools and Techniques

Tool 1: Test Question Library

Create a permanent library of test questions:

Categories:

General questions (20 questions)
Admissions (15 questions)
Courses (15 questions)
Faculty (10 questions)
Careers (10 questions)
Edge cases (10 questions)

Use:

Every time you edit prompts, run the full library
Update library as new question types emerge
Share with team for consistent testing

Tool 2: A/B Testing (Advanced)

If your system supports it:

Workflow:

Deploy change to 10% of users
Keep original prompt for 90% of users
Compare metrics between groups
If A/B test is positive, roll out to 100%

Benefits:

Lower risk (only 10% affected if change is bad)
Data-driven decision (compare real metrics)
Can test multiple versions simultaneously

Tool 3: Automated Testing Scripts (Advanced)

For highly technical teams:

Create a script that:

Sends test questions to chatbot API
Parses responses
Checks against expected patterns
Generates pass/fail report

Benefits:

Test 100+ questions in minutes
Consistent, repeatable tests
Catch regressions automatically

Example (pseudocode):

testQuestions.forEach(question => {
  response = chatbot.ask(question);
  if (response.includes("Source:")) {
    pass++;
  } else {
    fail++;
    console.log("FAIL: No source cited for: " + question);
  }
});
console.log(`Passed: ${pass}/${total}`);

Testing Checklist

Before editing:

Defined success criteria
Created test question set (10-20 questions)
Documented current behavior (baseline)

During testing (pre-save):

Used preview mode (if available)
Tested all questions in test set
Checked accuracy, tone, length, format, edge cases
Recorded results in test log
90%+ tests passed

After saving:

Tested immediately (first 15 minutes)
Monitored first hour for issues
Checked conversations daily for 48 hours
Collected user feedback
Evaluated metrics after 48 hours
Documented results in change log

Next Steps

Now that you understand testing:

Understand Hot Reload - How changes take effect immediately
Review Best Practices - Advanced prompt management strategies
Learn about Change History - Track and revert changes
Review Editing Prompts - Safe editing techniques

Remember: Testing is not optional. It's the difference between a successful change and an emergency rollback. Invest time in testing to save time fixing issues!

Why Testing Matters​

The Risk of Untested Changes​

Benefits of Thorough Testing​

Testing Phases​

Overview of Testing Workflow​

Phase 1: Pre-Edit Planning​

Before You Touch the Prompt​

Phase 2: Internal Testing (Pre-Save)​

Using Preview Mode (If Available)​

Manual Testing (No Preview Mode)​

What to Test​

Test 1: Response Accuracy​

Test 2: Response Tone​

Test 3: Response Length​

Test 4: Format and Structure​

Test 5: Edge Case Handling​

Recording Test Results​

Decision Point: Save or Edit More?​

Phase 3: Post-Save Testing​

Immediate Testing (First 15 Minutes)​

Extended Testing (First Hour)​

Phase 4: Monitoring Period (24-48 Hours)​

Daily Monitoring Checklist​

Morning Check (5-10 minutes)​

Evening Check (5-10 minutes)​

Collecting User Feedback​

Red Flags During Monitoring​

Phase 5: Evaluation​

After 48 Hours: Decide on Next Steps​

Decision Matrix​

Documenting Results​

Common Testing Mistakes​

Mistake 1: No Baseline​

Mistake 2: Too Few Test Questions​

Mistake 3: Testing Only Happy Path​

Mistake 4: Not Monitoring After Save​

Mistake 5: Testing in Your Own Words​

Testing Tools and Techniques​

Tool 1: Test Question Library​

Tool 2: A/B Testing (Advanced)​

Tool 3: Automated Testing Scripts (Advanced)​

Testing Checklist​

Next Steps​

Why Testing Matters

The Risk of Untested Changes

Benefits of Thorough Testing

Testing Phases

Overview of Testing Workflow

Phase 1: Pre-Edit Planning

Before You Touch the Prompt

Phase 2: Internal Testing (Pre-Save)

Using Preview Mode (If Available)

Manual Testing (No Preview Mode)

What to Test

Test 1: Response Accuracy

Test 2: Response Tone

Test 3: Response Length

Test 4: Format and Structure

Test 5: Edge Case Handling

Recording Test Results

Decision Point: Save or Edit More?

Phase 3: Post-Save Testing

Immediate Testing (First 15 Minutes)

Extended Testing (First Hour)

Phase 4: Monitoring Period (24-48 Hours)

Daily Monitoring Checklist

Morning Check (5-10 minutes)

Evening Check (5-10 minutes)

Collecting User Feedback

Red Flags During Monitoring

Phase 5: Evaluation

After 48 Hours: Decide on Next Steps

Decision Matrix

Documenting Results

Common Testing Mistakes

Mistake 1: No Baseline

Mistake 2: Too Few Test Questions

Mistake 3: Testing Only Happy Path

Mistake 4: Not Monitoring After Save

Mistake 5: Testing in Your Own Words

Testing Tools and Techniques

Tool 1: Test Question Library

Tool 2: A/B Testing (Advanced)

Tool 3: Automated Testing Scripts (Advanced)

Testing Checklist

Next Steps