Skip to main content

Testing Prompt Changes

A comprehensive guide to testing system prompt modifications before and after deployment to ensure quality and prevent issues.

Why Testing Mattersโ€‹

The Risk of Untested Changesโ€‹

What can go wrong:

  • โŒ Responses become inaccurate or misleading
  • โŒ Tone shifts inappropriately (too casual, too formal, rude)
  • โŒ AI refuses to answer valid questions
  • โŒ Responses get longer/shorter than intended
  • โŒ Source citations disappear or become incorrect
  • โŒ Edge cases are handled poorly

Real example:

Change: Added rule "Be concise, keep responses under 50 words"
Result: Responses became TOO short, missing important details
User feedback: "The chatbot used to be helpful, now it's useless"
Resolution: Reverted change, tested new version with 100-word limit

Benefits of Thorough Testingโ€‹

โœ… Catch issues before users do โœ… Validate changes work as intended โœ… Build confidence in your edits โœ… Reduce negative feedback โœ… Minimize need for emergency rollbacks โœ… Learn what works and what doesn't


Testing Phasesโ€‹

Overview of Testing Workflowโ€‹

1. Pre-Edit Planning (15 min)
โ†“
2. Internal Testing (30-60 min)
โ†“
3. Save Changes
โ†“
4. Post-Save Testing (30 min)
โ†“
5. Monitoring Period (24-48 hours)
โ†“
6. Evaluation (15 min)

Phase 1: Pre-Edit Planningโ€‹

Before You Touch the Promptโ€‹

Step 1: Define Success Criteria

Ask yourself:

  • What specific behavior should change?
  • What should stay the same?
  • How will I know if it worked?

Example:

Change: Add rule to cite sources more prominently
Success criteria:
โœ“ Every factual response includes a source citation
โœ“ Citations include document name and page number
โœ“ Response quality doesn't decrease
โœ“ Response length stays similar (100-200 words)
โœ— NOT a success: Citations are added but responses become robotic

Step 2: Create Test Question Set

Prepare 10-20 test questions that will be affected by your change:

Question types to include:

  1. Directly affected (5-7 questions)

    • Questions that should clearly show the change
    • Example: If adding citation rules, test factual questions
  2. Indirectly affected (3-5 questions)

    • Questions that might be influenced by the change
    • Example: If changing tone, test both simple and complex questions
  3. Edge cases (2-3 questions)

    • Unusual questions that test boundaries
    • Example: Vague questions, off-topic questions, multi-part questions
  4. Control questions (2-3 questions)

    • Questions that should NOT be affected by change
    • Example: If changing admissions rules, test career questions (should be unchanged)

Test Question Template:

## Test Set: Adding Source Citations

### Directly Affected
1. "What are the HFIM admission requirements?"
2. "How many credit hours do I need to graduate?"
3. "Who is the program director?"
4. "What courses are required for the HFIM major?"
5. "When is the application deadline?"

### Indirectly Affected
6. "Tell me about the HFIM program"
7. "What internships are available?"
8. "How does HFIM compare to other hospitality programs?"

### Edge Cases
9. "What's the best career path?" (subjective, no sources)
10. "I'm thinking about HFIM but not sure" (vague)

### Control (Should NOT Change)
11. "What is UGA?" (general question, unrelated to change)
12. "How's the weather?" (off-topic, should be redirected)

Step 3: Document Current Behavior (Baseline)

Before editing, test all questions and record responses:

Method:

  1. Open chatbot in user-facing interface
  2. Ask each test question
  3. Save responses (copy-paste to doc or spreadsheet)
  4. Note: response length, tone, accuracy, format

Example Baseline Log:

QuestionResponse LengthIncludes Citation?ToneQuality
"What are HFIM requirements?"145 wordsNoProfessionalGood
"Who is the program director?"32 wordsNoProfessionalGood

Why this matters: You need a before/after comparison to know if your change worked.


Phase 2: Internal Testing (Pre-Save)โ€‹

Using Preview Mode (If Available)โ€‹

If your system has a Preview feature:

Steps:

  1. Make your changes in the prompt editor
  2. Click "Preview" (don't save yet)
  3. A test interface opens with the NEW prompt
  4. Ask your test questions in the preview
  5. Compare responses with baseline
  6. If good โ†’ Save. If not โ†’ Edit more and preview again.

Benefits:

  • Test without affecting live chatbot
  • Iterate quickly
  • No risk to users
tip

Preview mode is the SAFEST way to test. Always use it if available!


Manual Testing (No Preview Mode)โ€‹

If no preview mode, you'll need to:

Option A: Test After Hours

  • Edit and save during low-traffic times (late night, weekends)
  • Test immediately after saving
  • Rollback quickly if issues found

Option B: Staging Environment

  • Use a separate test/staging chatbot (if available)
  • Apply changes there first
  • Test thoroughly before applying to production

Option C: Careful Live Testing

  • Save changes
  • Test immediately (5-10 minutes)
  • Monitor closely for first hour
  • Be ready to rollback
warning

Without preview mode, you're testing on the live chatbot. Test thoroughly but quickly, and be ready to revert if needed.


What to Testโ€‹

Test 1: Response Accuracyโ€‹

Check:

  • โœ… Factual information is correct
  • โœ… No hallucinations (made-up info)
  • โœ… Sources are cited accurately
  • โœ… Answers match the question asked

Example:

Question: "What is the HFIM program?"
Expected: Accurate description of HFIM
Red flag: Generic hospitality info not specific to UGA

Test 2: Response Toneโ€‹

Check:

  • โœ… Tone matches intended style (professional, friendly, etc.)
  • โœ… Consistent across all responses
  • โœ… Appropriate for audience (students, prospective students)
  • โœ… Not too casual or too formal

Example:

Question: "How do I apply?"
Good tone: "To apply to the HFIM program, follow these steps..."
Too casual: "Hey! So you wanna join HFIM? Cool! Here's how..."
Too formal: "The application process necessitates completion of the following prerequisites..."

Test 3: Response Lengthโ€‹

Check:

  • โœ… Responses are appropriately detailed
  • โœ… Not too short (missing info) or too long (overwhelming)
  • โœ… Consistent across similar questions

Guidelines:

  • Simple questions: 30-75 words
  • Moderate questions: 75-150 words
  • Complex questions: 150-300 words

Example:

Question: "What is HFIM?"
Too short (40 words): "HFIM is a program at UGA."
Good (120 words): "HFIM stands for Hospitality and Food Industry Management. It's a degree program at the University of Georgia that prepares students for careers in hotels, restaurants, event planning, and tourism. The program combines business fundamentals with hands-on hospitality training..."
Too long (350 words): [excessive detail about history, every course, all faculty...]

Test 4: Format and Structureโ€‹

Check:

  • โœ… Bullet points used for lists
  • โœ… Bold used for emphasis
  • โœ… Paragraphs broken up appropriately
  • โœ… Sources cited in correct format

Example:

Good formatting:
"The HFIM program requires:
โ€ข Minimum 3.0 GPA
โ€ข SAT 1200+ or ACT 24+
โ€ข Two letters of recommendation

Application deadline: January 15

Source: HFIM Handbook 2026, page 12"

Poor formatting:
"The HFIM program requires minimum 3.0 GPA SAT 1200+ or ACT 24+ two letters of recommendation application deadline January 15 source HFIM Handbook 2026 page 12"

Test 5: Edge Case Handlingโ€‹

Check how AI handles:

  • โ“ Vague questions ("Tell me about stuff")
  • โ“ Off-topic questions ("What's the weather?")
  • โ“ Multi-part questions ("What is HFIM and how do I apply and when is the deadline?")
  • โ“ Impossible questions ("Can you enroll me right now?")

Expected behaviors:

  • โœ… Asks for clarification on vague questions
  • โœ… Politely redirects off-topic questions
  • โœ… Breaks down multi-part questions
  • โœ… Explains limitations (can't perform actions)

Recording Test Resultsโ€‹

Use a spreadsheet or document to log findings:

Template:

QuestionBefore ChangeAfter ChangePass/FailNotes
"What are HFIM requirements?"No citationCitation: "HFIM Handbook, p.12"โœ… PASSExactly as intended
"Who is program director?"No citationGeneric response, no nameโŒ FAILLost specific info

Pass criteria:

  • โœ… Response meets success criteria
  • โœ… Quality is same or better than baseline
  • โœ… No unexpected side effects

Fail criteria:

  • โŒ Response doesn't meet success criteria
  • โŒ Quality decreased (less accurate, less helpful)
  • โŒ Unexpected side effects (formatting broken, tone changed)

Decision Point: Save or Edit More?โ€‹

After internal testing:

If 90%+ tests pass โ†’ Save the changes If 70-89% tests pass โ†’ Edit and retest If less than 70% tests pass โ†’ Major revision needed, go back to planning

If one or two tests fail:

  • Analyze why they failed
  • Determine if it's acceptable (minor edge case) or critical (major issue)
  • Edit if critical, proceed if acceptable

Phase 3: Post-Save Testingโ€‹

Immediate Testing (First 15 Minutes)โ€‹

After clicking "Save Changes":

  1. Wait 30 seconds for changes to propagate (hot reload)

  2. Open chatbot in user interface (new tab/window)

  3. Re-run ALL test questions from your test set

  4. Compare with internal testing results

    • Responses should match preview/test environment
    • If significantly different, investigate why
  5. Check 5-10 additional random questions

    • Test questions you DIDN'T prepare
    • Catch unexpected side effects

If issues found: Rollback immediately (see Change History)


Extended Testing (First Hour)โ€‹

During the first hour after saving:

Monitor:

  1. Conversations page - Check new conversations as they come in
  2. Response quality - Spot-check 10-20 real user questions
  3. Feedback - Watch for negative feedback spikes
  4. Response time - Ensure performance hasn't degraded

Red flags (rollback if seen):

  • ๐Ÿšจ Multiple negative feedback in first hour
  • ๐Ÿšจ Clearly wrong or inappropriate responses
  • ๐Ÿšจ Chatbot refusing to answer valid questions
  • ๐Ÿšจ Dramatic tone shift (too casual, too formal, rude)

Green flags (change is working):

  • โœ… Responses match expectations
  • โœ… No negative feedback spike
  • โœ… Users getting helpful answers
  • โœ… Feedback is neutral or positive

Phase 4: Monitoring Period (24-48 Hours)โ€‹

Daily Monitoring Checklistโ€‹

For 2 days after a prompt change:

Morning Check (5-10 minutes)โ€‹

Steps:

  1. Go to Conversations
  2. Filter by date: Yesterday + Today
  3. Review 20-30 recent conversations
  4. Count negative feedback instances
  5. Spot-check quality of responses

Questions to ask:

  • Are responses consistent with intended change?
  • Any unexpected issues?
  • Is negative feedback normal or elevated?

Evening Check (5-10 minutes)โ€‹

Steps:

  1. Go to Dashboard (if available)
  2. Check metrics:
    • Cache hit rate
    • Average response time
    • Negative feedback count
  3. Compare with pre-change baseline

Metrics to watch:

  • Negative feedback: Should stay below 5% (if it jumps to 15%+ โ†’ investigate)
  • Response time: Should stay consistent (if it doubles โ†’ investigate)
  • Cache hit rate: Shouldn't drop significantly

Collecting User Feedbackโ€‹

Actively seek feedback:

Method 1: Review Conversations

  • Filter by Negative Feedback
  • Read user comments
  • Identify patterns

Method 2: Ask Colleagues

  • Email team: "I changed the prompt yesterday to [X]. Please let me know if you notice any issues."
  • Request 5-10 test questions from colleagues
  • Compare their impressions with yours

Method 3: User Surveys (if available)

  • "How satisfied are you with the chatbot responses?"
  • "Is the chatbot helpful?"
  • Compare satisfaction before/after change

Red Flags During Monitoringโ€‹

Rollback immediately if you see:

  • ๐Ÿšจ Negative feedback > 15% (up from baseline 5%)
  • ๐Ÿšจ Multiple complaints about same issue
  • ๐Ÿšจ Inaccurate responses (wrong facts, hallucinations)
  • ๐Ÿšจ Inappropriate tone (rude, unprofessional, too casual)
  • ๐Ÿšจ Chatbot refuses to answer valid questions

Investigate (don't rollback yet) if you see:

  • โš ๏ธ Negative feedback 8-12% (slight increase, monitor more)
  • โš ๏ธ One or two complaints (may be isolated)
  • โš ๏ธ Minor formatting issues (fixable with small edit)

Phase 5: Evaluationโ€‹

After 48 Hours: Decide on Next Stepsโ€‹

Compare post-change metrics with baseline:

MetricBefore ChangeAfter ChangeStatus
Negative feedback rate5%4%โœ… Improved
Avg response length120 words115 wordsโœ… Good (as intended)
Cache hit rate72%71%โœ… Stable
User complaints2/week1/weekโœ… Improved

Decision Matrixโ€‹

If metrics improved or stayed stable: โœ… Success! Keep the change. Document success in change log.

If metrics slightly worse (negative feedback 5% โ†’ 8%): โš ๏ธ Monitor longer (1 more week). May need minor adjustment.

If metrics significantly worse (negative feedback 5% โ†’ 15%): โŒ Rollback. Revise approach, test more thoroughly, try again.


Documenting Resultsโ€‹

Add to your change log:

Example Entry:

Date: 1/15/2026
Change: Added rule to cite page numbers in sources
Test results: 18/20 test questions passed (90%)
Post-deployment metrics:
- Negative feedback: 5% โ†’ 3% (improved!)
- Avg response length: 120 โ†’ 125 words (acceptable)
- User feedback: "I love that sources now include page numbers!"
Outcome: SUCCESS - keeping change
Lessons learned: Thorough testing prevented issues. Users appreciate specific source citations.

Common Testing Mistakesโ€‹

Mistake 1: No Baselineโ€‹

Problem: You test after changing the prompt but have no "before" comparison.

Consequence: You can't tell if responses improved, degraded, or stayed the same.

Solution: Always document current behavior before editing.


Mistake 2: Too Few Test Questionsโ€‹

Problem: You test with only 2-3 questions.

Consequence: Miss edge cases, side effects, or inconsistencies.

Solution: Prepare 10-20 test questions covering different scenarios.


Mistake 3: Testing Only Happy Pathโ€‹

Problem: You only test straightforward questions.

Consequence: Edge cases break (vague questions, off-topic, multi-part).

Solution: Include edge cases in your test set.


Mistake 4: Not Monitoring After Saveโ€‹

Problem: You save, test once, then walk away.

Consequence: Issues appear later (after different question types are asked by real users).

Solution: Monitor conversations for 24-48 hours after any change.


Mistake 5: Testing in Your Own Wordsโ€‹

Problem: You only test questions phrased the way YOU would ask them.

Consequence: You miss how real users (with different backgrounds, vocabulary) will ask.

Solution: Use actual user questions from Conversations as test questions. Include casual, misspelled, and fragmented phrasings.


Testing Tools and Techniquesโ€‹

Tool 1: Test Question Libraryโ€‹

Create a permanent library of test questions:

Categories:

  • General questions (20 questions)
  • Admissions (15 questions)
  • Courses (15 questions)
  • Faculty (10 questions)
  • Careers (10 questions)
  • Edge cases (10 questions)

Use:

  • Every time you edit prompts, run the full library
  • Update library as new question types emerge
  • Share with team for consistent testing

Tool 2: A/B Testing (Advanced)โ€‹

If your system supports it:

Workflow:

  1. Deploy change to 10% of users
  2. Keep original prompt for 90% of users
  3. Compare metrics between groups
  4. If A/B test is positive, roll out to 100%

Benefits:

  • Lower risk (only 10% affected if change is bad)
  • Data-driven decision (compare real metrics)
  • Can test multiple versions simultaneously

Tool 3: Automated Testing Scripts (Advanced)โ€‹

For highly technical teams:

Create a script that:

  1. Sends test questions to chatbot API
  2. Parses responses
  3. Checks against expected patterns
  4. Generates pass/fail report

Benefits:

  • Test 100+ questions in minutes
  • Consistent, repeatable tests
  • Catch regressions automatically

Example (pseudocode):

testQuestions.forEach(question => {
response = chatbot.ask(question);
if (response.includes("Source:")) {
pass++;
} else {
fail++;
console.log("FAIL: No source cited for: " + question);
}
});
console.log(`Passed: ${pass}/${total}`);

Testing Checklistโ€‹

Before editing:

  • Defined success criteria
  • Created test question set (10-20 questions)
  • Documented current behavior (baseline)

During testing (pre-save):

  • Used preview mode (if available)
  • Tested all questions in test set
  • Checked accuracy, tone, length, format, edge cases
  • Recorded results in test log
  • 90%+ tests passed

After saving:

  • Tested immediately (first 15 minutes)
  • Monitored first hour for issues
  • Checked conversations daily for 48 hours
  • Collected user feedback
  • Evaluated metrics after 48 hours
  • Documented results in change log

Next Stepsโ€‹

Now that you understand testing:

  1. Understand Hot Reload - How changes take effect immediately
  2. Review Best Practices - Advanced prompt management strategies
  3. Learn about Change History - Track and revert changes
  4. Review Editing Prompts - Safe editing techniques

Remember: Testing is not optional. It's the difference between a successful change and an emergency rollback. Invest time in testing to save time fixing issues!