Understanding Data Replication in High Availability Configurations for GitHub Enterprise Server #190702

regmontanhani · 2026-03-26T14:08:59Z

regmontanhani
Mar 26, 2026

Abstract

Running GitHub Enterprise Server in a High Availability (HA) configuration? You need to understand how data replicates between your primary and replica appliances - for capacity planning, troubleshooting, and ensuring clean failovers. The official documentation covers HA setup. This article goes deeper: how replication works under the hood, what affects performance, and what to watch to keep things healthy.

Problem Statement

GitHub Enterprise Server administrators running HA configurations often ask:

Why is my replica falling behind?
How does replication actually work for different data types?
What should I monitor to catch problems early?
How do maintenance and backups interact with replication?

The public docs cover setup and failover, but don't explain replication internals, performance factors, or troubleshooting in depth.

How Replication Works

GitHub Enterprise Server uses different replication strategies for different data types. Understanding these distinctions helps you troubleshoot issues and plan capacity.

The hub-and-spoke model

Replication follows a hub-and-spoke architecture:

The primary appliance is the hub - all data processing happens here
Replica appliances are spokes that receive replicated data from the primary
All replication flows from the primary - replicas never replicate to each other
Replicas connect to the primary through a secure WireGuard VPN

Each node has a unique UUID, shown as git-server-{UUID} in diagnostic output. To identify your primary, run ghe-repl-status -vv on a replica and look for the node with voting: true, or check /etc/github/repl-state directly on any appliance.

Git repository replication

Git repositories replicate through the built-in "spokes" system:

A user pushes changes to a repository on the primary, and those changes commit locally first
Background jobs detect the changes and queue them for replication
Changes replicate to replica appliances over the VPN
Repository networks track routing information - which repos live on which nodes

Replication is asynchronous: your push succeeds without waiting for replicas to catch up. Maintenance operations run with a two-hour timeout. If they don't finish, the system retries. Backups pause repository maintenance to ensure a clean state.

Route updates happen automatically after configuration changes, but you may need ghe-spokesctl routes to check and repair routes manually.

MySQL database replication

MySQL uses binary log (binlog) replication:

All writes happen on the primary
Replicas receive and apply changes as they occur
Lag is typically minimal - seconds to minutes under normal load

You can check MySQL replication lag with ghe-repl-status -vv, which shows the seconds_behind_primary metric. This is the most common lag indicator to watch.

Redis replication

Redis uses asynchronous replication:

The primary handles all reads and writes
Changes stream to replicas in real time
Replicas maintain a full copy of the dataset

Since Redis is in-memory, replication is fast.

Elasticsearch replication

Elasticsearch runs on both primary and replica in HA configurations, maintaining index replication across nodes. The replica keeps a synchronized copy of all search indexes. For more on how GitHub Enterprise Server rebuilt search replication for high availability, see this engineering blog post.

Storage and asset replication

File-based storage (user avatars, release assets, Git LFS objects) replicates with rsync:

Initial sync copies all existing files
Incremental sync transfers only what's changed
Large files or high file counts can take significant time to replicate

Pages replication

GitHub Pages sites replicate through the spokes system, just like Git repositories.

Factors Affecting Replication Performance

Network bandwidth and latency

Since all replication traffic flows through the VPN between appliances:

Bandwidth saturation is a common cause of replication lag in high-traffic environments
Geographic distance between primary and replicas increases latency
Data center network latency directly affects replication speed

Monitor network throughput between your appliances, especially if you see persistent lag.

Repository size and activity

Larger repos and higher push rates increase replication load:

Repos with large binary files generate more replication traffic
Monorepos with frequent pushes can bottleneck replication queues
Network maintenance on large repos can hit the two-hour timeout

Maintenance operations

Repository maintenance directly affects replication:

Garbage collection (git gc) on large repos generates significant traffic as optimized data replicates
Repacking temporarily increases replication lag
Database maintenance can slow replication performance

The longer you wait between maintenance runs, the more stale refs accumulate - making the next run take even longer.

Resource constraints

CPU, memory, and disk I/O on both primary and replica affect replication:

High CPU on the primary delays replication job processing
Insufficient replica resources slow processing of replicated data
Disk I/O bottlenecks affect both replication and repository maintenance

If ghe-spokesctl check --fix fails with "too busy" errors, you're hitting resource constraints.

Monitoring Replication Health

Using ghe-repl-status

Run ghe-repl-status from any replica to check replication health:

$ ghe-repl-status
OK: mysql replication is in sync
OK: redis replication is in sync
OK: elasticsearch cluster is in sync
OK: git replication is in sync
OK: pages replication is in sync
OK: alambic replication is in sync
OK: git-hooks replication is in sync
OK: consul replication is in sync

Add -vv for verbose output with detailed metrics.

Checking repository network health

Use ghe-spokesctl to check repository network health. Several ghe-spokes subcommands have already been replaced by ghe-spokesctl equivalents, and the remaining ones are expected to follow.

# Show summary of repo placement and replica health
ghe-spokesctl status

# Show detailed replica information for a repository
ghe-spokesctl check -v owner/repo

# Check which servers a repository is stored on
ghe-spokesctl routes owner/repo

# Check and fix repository issues
ghe-spokesctl check -v --fix owner/repo

Watch for bad checksums - they mean replica data doesn't match the primary.

Job queue monitoring

Check job queue backlogs, especially the maintenance queue:

ghe-aqueduct-info

Large backlogs signal system stress and can cause replication delays. The older ghe-resque-info command still works, but most background jobs now run through aqueduct.

Key metrics to watch

When analyzing support bundles or monitoring HA health, check these in order:

CPU load vs CPU count - Load exceeding CPU count means resource stress
Diagnostics > Spokes info - Voting status, replica health, bad checksums
Repository network maintenance - Repos that haven't completed maintenance in months
Job queue backlogs - Especially the maintenance queue
Backup status - Recent backup timing affects maintenance windows

HA Configuration Patterns

Two-node high availability

In a standard two-node configuration (one primary, one replica):

All data replicates from primary to the single replica
The replica must stay in sync to support failover
Replication performance depends entirely on the replica keeping pace

Geo-replication

With geo-replication, multiple replicas in different locations receive data from the primary:

Each replica independently replicates from the primary
Replicas don't replicate between themselves
Geo-replicas serve read traffic (clones/fetches) for users in their region
All writes still go through the primary

Trade-offs:

Better read performance for distributed teams
More replication traffic from the primary (multiple targets)
More complex monitoring - you need to track each replica independently
Bandwidth requirements multiply with each additional replica
Initial sync of a new geo-replica can saturate a WAN link for days on large instances - plan accordingly and consider scheduling it during off-peak hours

Best Practices

Network design

Place replicas on low-latency networks relative to the primary
Ensure enough bandwidth for both steady-state replication and initial sync
Monitor VPN throughput between appliances
Factor in network topology when adding geo-replicated nodes

Capacity planning

Size replicas with the same resources as the primary
Plan for peak replication load, not average
Account for maintenance windows that create replication spikes
Track resource utilization trends over time

Repository management

Break up monorepos where feasible to reduce replication bottlenecks
Schedule large repo maintenance during off-peak hours
Monitor repos with frequent maintenance timeouts
Run ghe-spokesctl check -v proactively on your largest repos

Backup coordination

Backups pause repository maintenance to ensure a clean state. Schedule backups considering:

Impact on maintenance window completion
Repositories that are behind on maintenance
Overall replication lag trends

Troubleshooting Common Issues

Repositories with bad checksums

Symptom: ghe-spokesctl status shows checksum mismatches

Common causes:

Incomplete replication from network issues
Repository corruption
Resource constraints during replication

Resolution:

Run ghe-spokesctl check -v owner/repo
If unresolved, run ghe-spokesctl check -v --fix owner/repo
If repair fails with "too busy", investigate resource constraints
Contact GitHub Support if issues persist

Repository network route issues

Symptom: Repositories not replicating after configuration changes

Common causes:

Configuration changes that altered routing
Node UUID changes
Network interruptions during route updates

Resolution: Run ghe-spokesctl routes owner/repo to check and repair routes

Persistent replication lag

Symptom: ghe-repl-status consistently shows lag across multiple datastores

Common causes:

Network bandwidth saturation
Resource constraints (CPU, disk I/O)
Large maintenance queue backlogs
Repository networks failing maintenance repeatedly

Resolution:

Check CPU load vs CPU count
Review network bandwidth utilization
Check job queues with ghe-aqueduct-info
Identify repos past the maintenance threshold
Contact GitHub Support with a support bundle

Summary

Understanding how data replicates in your HA configuration helps you plan capacity, troubleshoot performance, and ensure clean failovers. Key takeaways:

All replication flows from the primary in a hub-and-spoke model
Different datastores use different replication strategies with different performance characteristics
Repository maintenance has a two-hour timeout and interacts with backups
Replicas communicate over a secure VPN, making network bandwidth critical
Monitor ghe-repl-status, spokes info, job queues, and resource utilization

When you hit replication issues, work through the diagnostics systematically: resource utilization first, then spokes health, then specific repository networks. Most problems stem from resource constraints or network limitations - not bugs in the replication system.

Planning a new HA deployment, dealing with persistent lag, or need help troubleshooting? Reach out to GitHub Support with a support bundle. We're happy to help analyze your configuration.

Dennebog · 2026-03-28T12:41:28Z

Dennebog
Mar 28, 2026

Context: I signed up for a GitHub trial (GitHub Pro / Copilot), where I was told a small verification charge (~$10) would be momentary and immediately released.

GitHub’s handling of this charge is frankly unacceptable.

This was not a “temporary” hold in any meaningful sense. The authorized amount has now been locked for a week—and can reportedly last up to 30 days—despite being presented as something that would be reversed almost instantly.

GitHub Support claims the refund was processed “right away,” but this is misleading. In practice, GitHub delegates the release of the authorization to a financial intermediary, fully aware that this is not an immediate process. Describing that as “momentary” is simply inaccurate.

After speaking with my bank, the situation is clear: the hold has not been released in a way that makes the funds available. There are only two ways forward—either wait for the authorization to expire (which can take up to 30 days), or have GitHub explicitly confirm to the bank that the hold should be released.

This is the key issue: it should not be the customer’s responsibility to resolve this. If GitHub initiates the authorization, then GitHub should also ensure its timely release—without requiring users to chase their own money through banking procedures.

In practice:

GitHub claims it’s been handled
The funds remain inaccessible
The burden shifts to the user to fix it

For anyone on a tight budget, this is not a minor inconvenience—it directly affects day-to-day expenses.

If GitHub knows these authorizations can persist for weeks, then calling them “momentary” is misleading and should be corrected. Users deserve clear, honest expectations about how long their money may actually be unavailable.

This situation reflects poorly on both the transparency and accountability of GitHub’s billing practices.

0 replies

Dennebog · 2026-03-28T12:42:23Z

Dennebog
Mar 28, 2026

Best practice when dealing with Github customer trial bamboozling tactics: Context: I signed up for a GitHub trial (GitHub Pro / Copilot), where I was told a small verification charge (~$10) would be momentary and immediately released.

GitHub’s handling of this charge is frankly unacceptable.

This was not a “temporary” hold in any meaningful sense. The authorized amount has now been locked for a week—and can reportedly last up to 30 days—despite being presented as something that would be reversed almost instantly.

GitHub Support claims the refund was processed “right away,” but this is misleading. In practice, GitHub delegates the release of the authorization to a financial intermediary, fully aware that this is not an immediate process. Describing that as “momentary” is simply inaccurate.

After speaking with my bank, the situation is clear: the hold has not been released in a way that makes the funds available. There are only two ways forward—either wait for the authorization to expire (which can take up to 30 days), or have GitHub explicitly confirm to the bank that the hold should be released.

This is the key issue: it should not be the customer’s responsibility to resolve this. If GitHub initiates the authorization, then GitHub should also ensure its timely release—without requiring users to chase their own money through banking procedures.

In practice:

GitHub claims it’s been handled
The funds remain inaccessible
The burden shifts to the user to fix it

For anyone on a tight budget, this is not a minor inconvenience—it directly affects day-to-day expenses.

If GitHub knows these authorizations can persist for weeks, then calling them “momentary” is misleading and should be corrected. Users deserve clear, honest expectations about how long their money may actually be unavailable.

This situation reflects poorly on both the transparency and accountability of GitHub’s billing practices.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Community

Understanding Data Replication in High Availability Configurations for GitHub Enterprise Server #190702

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

GitHub Community

Understanding Data Replication in High Availability Configurations for GitHub Enterprise Server #190702

Uh oh!

Uh oh!

regmontanhani Mar 26, 2026

Abstract

Problem Statement

How Replication Works

The hub-and-spoke model

Git repository replication

MySQL database replication

Redis replication

Elasticsearch replication

Storage and asset replication

Pages replication

Factors Affecting Replication Performance

Network bandwidth and latency

Repository size and activity

Maintenance operations

Resource constraints

Monitoring Replication Health

Using ghe-repl-status

Checking repository network health

Job queue monitoring

Key metrics to watch

HA Configuration Patterns

Two-node high availability

Geo-replication

Best Practices

Network design

Capacity planning

Repository management

Backup coordination

Troubleshooting Common Issues

Repositories with bad checksums

Repository network route issues

Persistent replication lag

Summary

Replies: 2 comments

Uh oh!

Dennebog Mar 28, 2026

Uh oh!

Dennebog Mar 28, 2026

regmontanhani
Mar 26, 2026

Dennebog
Mar 28, 2026

Dennebog
Mar 28, 2026