The Pull Request That Saved Our Project: A Post-Mortem Turned Success Story

It started with a failed deployment on a Friday afternoon. The charity's donor management system—used by fifty staff members and hundreds of volunteers—had been down for three hours. The lead developer was on leave. The only person available was a junior volunteer who had never touched the production environment. What happened next turned a crisis into a case study in how open-source collaboration can rescue a project, even when the team is stretched thin.

This guide walks through that pull request: what went wrong, how the fix evolved from a hotfix into a permanent improvement, and the lessons that any charity tech team can apply. We will cover the mechanics of a post-mortem that leads to real change, the common mistakes that keep teams stuck in firefighting mode, and the long-term practices that turn one good PR into a sustainable workflow.

1. The Field Context: Where This Scenario Plays Out

Nonprofit software projects often run on a shoestring. A typical charity tech stack might include a WordPress site, a custom donor database built by a former intern, and a handful of integrations with fundraising platforms. The team is a mix of paid staff and volunteers, many of whom contribute in their spare time. Code reviews are rare, documentation is sparse, and the person who wrote the original code may have moved on.

In this environment, a pull request is not just a code change—it is a communication tool. It forces the author to explain their reasoning, invites scrutiny from others, and creates a permanent record of why a change was made. The PR that saved our project started as a one-line fix for a database connection timeout. By the time it was merged, it had triggered a full audit of the deployment pipeline, a rewrite of the testing strategy, and a new policy for handling urgent patches.

The Trigger Event

The original bug was mundane: a query that ran too long against the donor database, causing the application to hang. The volunteer who picked up the ticket—let us call them Alex—found that the timeout setting in the configuration file was hardcoded to five seconds, even though the average query took eight seconds to complete during peak hours. Alex submitted a PR that increased the timeout to fifteen seconds, which would have fixed the immediate symptom.

Why the PR Almost Failed

But the reviewer, a senior developer who had been with the charity for three years, noticed something else. The query itself was inefficient: it was pulling every donor record and filtering in application memory instead of using a database index. Increasing the timeout would only mask the underlying performance problem. The reviewer rejected the PR with a comment: 'This will break again next month when the donor list grows.'

From Rejection to Collaboration

Instead of frustration, Alex and the reviewer started a conversation. They pair-debugged the query, added an index, and rewrote the function to paginate results. The revised PR included a migration script for the index, updated the timeout to a more reasonable ten seconds, and added a unit test that would fail if the query ever regressed. That PR took three days to finalize, but it fixed not only the timeout but also the root cause.

2. Foundations That Readers Often Confuse

Many teams treat a post-mortem as a blame exercise or a documentation chore. The real foundation is a blameless culture that separates the person from the problem. In charity projects, where contributors are often volunteers, blame can drive people away. The post-mortem that saved our project started with a simple rule: every comment on the PR had to start with 'What if we tried…' instead of 'You should have…'

Post-Mortem vs. Retrospective

These terms are often used interchangeably, but they serve different purposes. A post-mortem is a structured analysis after an incident, focused on what happened and how to prevent recurrence. A retrospective is a regular team reflection on process, not tied to a specific failure. For the PR story, we conducted a post-mortem after the deployment outage, but the lessons fed into the team's monthly retrospective.

Root Cause vs. Trigger

The trigger was the timeout setting. The root cause was the unindexed query and the lack of performance testing. Many post-mortems stop at the trigger and apply a superficial fix. The deep work is identifying the systemic issues—in this case, no code review policy for database changes, no performance benchmarks, and no staging environment that mirrored production load.

Blameless ≠ No Accountability

A blameless post-mortem does not mean everyone walks away without responsibility. It means the investigation focuses on systems, not individuals. After the PR was merged, the team agreed that the person who deployed on a Friday without a rollback plan would write the runbook for emergency deployments. That was accountability, not blame.

3. Patterns That Usually Work

From this experience and many similar stories across the charity sector, several patterns consistently help turn a failing project around. These are not silver bullets, but they raise the odds of success.

Small, Frequent Pull Requests

The PR that saved the project started small—a one-line config change—but grew as the reviewer pushed for deeper fixes. The ideal pattern is to keep PRs small (under 200 lines) and frequent (daily or every other day). Small PRs are easier to review, less likely to introduce conflicts, and less intimidating for volunteer contributors. The team now enforces a rule: any PR that touches the database must include a migration and a performance test.

Mandatory Code Review for All Changes

Even the most trivial fix benefits from a second pair of eyes. In the charity, the policy had been that only production deployments needed review. After the outage, every PR—including documentation changes—must have at least one approval from someone who did not write the code. This slowed down trivial changes but caught dozens of subtle bugs in the first month.

Automated Testing as a Safety Net

The original project had no test suite. The volunteer who wrote the PR added a unit test for the query function. That test became the seed of a larger test suite. Within three months, the team had coverage for all critical paths: donation processing, user authentication, and report generation. Automated tests do not prevent all bugs, but they catch regressions before they reach production.

Staging Environment That Mirrors Production

One reason the timeout issue reached production was that the staging environment used a small subset of donor data. Queries that ran fine on 1,000 records failed on 50,000. The charity invested in a staging environment with anonymized production data and the same server configuration. This was not expensive—they used a cheaper cloud tier and scheduled it to run only during business hours—but it made a huge difference.

4. Anti-Patterns and Why Teams Revert

Even with good intentions, teams often fall back into old habits. Recognizing these anti-patterns early can save a project from sliding back into crisis mode.

The Heroic Hotfix

The most common anti-pattern is the 'hero' who deploys a fix directly to production without review, often late at night. This feels efficient in the moment, but it bypasses all safeguards. In our story, the original timeout change could have been hotfixed in minutes, but that would have left the underlying query broken. The team now has a policy: hotfixes are allowed only if a PR is opened within one hour and reviewed within 24 hours. Otherwise, the change must go through the normal pipeline.

Post-Mortem Fatigue

After the first few post-mortems, teams often stop doing them. The meetings feel repetitive, and the action items pile up without being completed. To avoid this, the charity limited post-mortems to incidents that caused data loss, downtime over 30 minutes, or a security breach. Everything else was handled in the regular retrospective. This kept the process meaningful.

Blaming the Volunteer

Volunteers are the lifeblood of charity tech, but they are often scapegoated when things go wrong. In the PR story, the volunteer who made the initial fix could have been blamed for not catching the performance issue. Instead, the team thanked them for starting the conversation. The post-mortem explicitly stated that the system failed the volunteer, not the other way around.

Over-Engineering the Fix

After an outage, there is a temptation to rewrite everything. The team considered migrating to a different database entirely. But the post-mortem showed that the existing database was fine—it just needed better indexing and query patterns. Over-engineering wastes volunteer time and introduces new bugs. The principle: fix the smallest thing that will prevent recurrence, then iterate.

5. Maintenance, Drift, and Long-Term Costs

Even a successful post-mortem and PR do not guarantee long-term health. Systems drift. New volunteers join and do not know the history. Old tests become brittle. The charity learned this the hard way six months after the incident.

Documentation Decay

The team wrote a detailed post-mortem document, but nobody updated it when the deployment process changed. A new volunteer tried to follow the runbook and hit a dead end because a command had been deprecated. The solution was to treat documentation like code: store it in the same repository, review it alongside PRs, and test it periodically by having someone new follow it from scratch.

Test Suite Rot

The unit tests that were added after the incident worked for months, but they were never expanded to cover edge cases. When a new feature was added, the tests still passed, but they did not catch a regression in the donation flow. The team now has a rule: every PR that adds a feature must also add or update at least one test. This keeps the test suite alive.

Volunteer Turnover

The volunteer who wrote the original PR left after a year. Their knowledge of the query optimization was not transferred. The team now requires that every significant PR include a comment explaining the rationale, and they record a short video walkthrough for complex changes. This is not a perfect solution, but it reduces the bus factor.

Cost of Maintenance

Maintaining the improvements takes time. The charity estimates that the code review policy adds about 10% overhead to development time, but it has reduced production incidents by 70%. The trade-off is worth it, but it requires buy-in from leadership. The key is to frame maintenance as an investment, not a tax.

6. When Not to Use This Approach

The post-mortem-driven PR approach is not suitable for every situation. Knowing when to skip it can save time and frustration.

One-Time Scripts

If you are running a one-time data migration or a script that will never be used again, a full post-mortem and review process is overkill. Use a simple checklist instead: test on a copy, have one person verify, and delete the script after use.

Emergency Security Patches

When a zero-day vulnerability is being exploited, speed matters more than process. In those cases, deploy the fix immediately, then conduct a post-mortem afterward. The charity had a security incident where they applied a patch within 30 minutes, then spent two days documenting what happened and improving the response process.

Teams Without Review Capacity

If your team has only one developer, mandatory code review is impossible. In that case, focus on automated testing and pair programming when possible. The post-mortem can still be done solo, but the insights will be limited. Consider joining a community of practice where you can exchange reviews with other charity developers.

Projects That Are Being Retired

If a system is scheduled for decommissioning within a few months, investing in a deep post-mortem and refactoring is wasteful. Apply the minimal fix to keep it running, document the known issues for the migration, and move on. The charity had an old volunteer management system that they decided to replace; they stopped doing full post-mortems for it and focused on data extraction.

7. Open Questions and FAQ

This section addresses common questions that arise when teams try to replicate the success of the PR story.

How do you get volunteers to participate in code review?

Make it easy and rewarding. Use a bot that assigns reviewers randomly, and set a SLA of 24 hours for review. Thank reviewers publicly in the team chat. Some charities offer small recognition—a shout-out in the newsletter or a digital badge. The key is to make review a expected part of contribution, not an optional extra.

What if the post-mortem reveals that a specific person caused the outage?

Frame it as a system failure. Ask: 'What in our process allowed this mistake to reach production?' If the same person makes the same mistake repeatedly, it is a training or tooling issue, not a character flaw. The charity had a volunteer who accidentally deleted a production table; the post-mortem led to adding a confirmation prompt and restricting delete permissions to a separate role.

How do you prioritize post-mortem action items?

Use a simple impact-effort matrix. Fixes that prevent data loss or downtime over 30 minutes are high priority. Cosmetic issues or nice-to-haves go into a backlog. The charity limits each post-mortem to three action items maximum, to avoid overwhelm.

Should you involve the whole team in the post-mortem?

Only the people directly involved in the incident and those responsible for the affected systems. A large meeting wastes time and discourages honest conversation. The charity invites the incident commander, the person who deployed, the reviewer, and one representative from each affected team. The results are shared broadly, but the discussion is kept small.

How do you measure success?

Track incident frequency, mean time to recover, and the percentage of changes that go through review. The charity saw incidents drop from two per month to one per quarter within six months. But the most important metric is team morale: do people feel safe to raise concerns? The post-mortem survey includes a question: 'Would you feel comfortable reporting a mistake you made?' If the answer is no, the culture needs work.

8. Summary and Next Experiments

The pull request that saved our project was not a single line of code—it was a catalyst for a cultural shift. The charity moved from reactive firefighting to proactive maintenance, from blame to learning, and from isolation to collaboration. The technical changes—indexing, testing, staging—were important, but the human changes were what made them stick.

Three Experiments to Try This Month

First, pick one recent incident—even a minor one—and run a blameless post-mortem. Write down the trigger, the root cause, and one action item. Second, enforce code review for all PRs for one week, even if it slows things down. See how many issues are caught before merge. Third, add a performance test to the most critical query in your application. If you do not have a test suite, start with one test for the most common user action. These small steps compound into a culture that turns post-mortems into success stories.

Every charity project has a moment where a crisis could become a turning point. The difference is whether the team treats it as a failure to be forgotten or a lesson to be shared. The PR that saved our project was not the end of problems—it was the beginning of a better way of working together.

The Pull Request That Saved Our Project: A Post-Mortem Turned Success Story

Table of Contents

1. The Field Context: Where This Scenario Plays Out

The Trigger Event

Why the PR Almost Failed

From Rejection to Collaboration

2. Foundations That Readers Often Confuse

Post-Mortem vs. Retrospective

Root Cause vs. Trigger

Blameless ≠ No Accountability

3. Patterns That Usually Work

Small, Frequent Pull Requests

Mandatory Code Review for All Changes

Automated Testing as a Safety Net

Staging Environment That Mirrors Production

4. Anti-Patterns and Why Teams Revert

The Heroic Hotfix

Post-Mortem Fatigue

Blaming the Volunteer

Over-Engineering the Fix

5. Maintenance, Drift, and Long-Term Costs

Documentation Decay

Test Suite Rot

Volunteer Turnover

Cost of Maintenance

6. When Not to Use This Approach

One-Time Scripts

Emergency Security Patches

Teams Without Review Capacity

Projects That Are Being Retired

7. Open Questions and FAQ

How do you get volunteers to participate in code review?

What if the post-mortem reveals that a specific person caused the outage?

How do you prioritize post-mortem action items?

Should you involve the whole team in the post-mortem?

How do you measure success?

8. Summary and Next Experiments

Three Experiments to Try This Month

Comments (0)

Table of Contents

1. The Field Context: Where This Scenario Plays Out

The Trigger Event

Why the PR Almost Failed

From Rejection to Collaboration

2. Foundations That Readers Often Confuse

Post-Mortem vs. Retrospective

Root Cause vs. Trigger

Blameless ≠ No Accountability

3. Patterns That Usually Work

Small, Frequent Pull Requests

Mandatory Code Review for All Changes

Automated Testing as a Safety Net

Staging Environment That Mirrors Production

4. Anti-Patterns and Why Teams Revert

The Heroic Hotfix

Post-Mortem Fatigue

Blaming the Volunteer

Over-Engineering the Fix

5. Maintenance, Drift, and Long-Term Costs

Documentation Decay

Test Suite Rot

Volunteer Turnover

Cost of Maintenance

6. When Not to Use This Approach

One-Time Scripts

Emergency Security Patches

Teams Without Review Capacity

Projects That Are Being Retired

7. Open Questions and FAQ

How do you get volunteers to participate in code review?

What if the post-mortem reveals that a specific person caused the outage?

How do you prioritize post-mortem action items?

Should you involve the whole team in the post-mortem?

How do you measure success?

8. Summary and Next Experiments

Three Experiments to Try This Month

Share this article:

Comments (0)