[EN] Analysis of a Housekeeping Performance Issue in HPE StoreOnce Catalyst Store and Steps to Resolve It

In recent days, I encountered a case in which Veeam Backup jobs on a customer’s HPE StoreOnce infrastructure were slowing down unexpectedly and continuously generating a “Target Bottleneck” warning across nearly all jobs.

At first, the issue appeared to be a classic storage performance bottleneck. However, as a result of a layered analysis, it became clear that the problem was not directly caused by disk performance or a network-side bottleneck, but rather by a blockage in the StoreOnce housekeeping mechanism.

In such cases, the most critical point is not to evaluate the symptoms only at a surface level, but also to include the background maintenance processes running on StoreOnce in the scope of analysis. When housekeeping processes stop or fail to progress, the system is affected not only in terms of capacity, but also in terms of deduplication efficiency, metadata access, and overall write performance.

Initial Signs of the Problem

The symptoms observed in the field were quite typical, although at first glance they did not directly point to housekeeping. During the checks performed on StoreOnce, the following findings stood out:

Pending Housekeeping had reached 51.9 TB.
The Processed Housekeeping graph was almost at zero.
It was clearly visible that housekeeping processes had not progressed for a long time.
The following recurring error was present in the kernel log:
“hpilo… could not dequeue a packet”
A queue stall-like blockage was being observed at the operating system level.
Veeam backup jobs were waiting for unusually long periods, especially during the finalize stage.
The backup window was extending, throughput was dropping, and the “Target Bottleneck” alarm was continuously appearing on the Veeam side.

When evaluated together, these symptoms showed that the issue was not simply a basic performance problem such as “backups are running slowly.” Instead, it became evident that housekeeping processes on StoreOnce were not progressing properly, which meant that the backend cleanup and metadata organization mechanisms had effectively become stuck.

What Is Housekeeping and Why Is It Critical?

Within the StoreOnce architecture, the housekeeping mechanism is often overlooked. In reality, however, it is one of the most important background processes for maintaining sustainable system performance.

Housekeeping is primarily responsible for the following:

Organizing the deduplication metadata structure
Physically cleaning deleted or no longer used blocks
Reducing fragmentation
Contributing to the capacity reclamation process
Helping maintain balanced read and write performance

Because StoreOnce relies heavily on metadata processing due to its deduplication logic, a failure in housekeeping affects far more than just free space reclamation. Metadata access costs increase, delays appear in the I/O chain, and the overall responsiveness of the appliance declines.

This directly results in slower backup jobs, prolonged finalize times, repository-side bottlenecks, and in some cases, degraded restore performance as well.

In short, a stalled housekeeping process is a critical but often invisible failure state that impacts the entire performance layer of StoreOnce.

Root Cause Analysis

At the beginning of the investigation, traditional areas such as storage utilization, disk latency, Catalyst Store behavior, Veeam repository connectivity, and network traffic were checked. However, none of these layers showed an anomaly large or persistent enough to explain the issue.

The real turning point came when StoreOnce housekeeping metrics and system logs were examined more closely.

Two indicators were especially decisive:

The housekeeping pending amount was extremely high
The processed housekeeping value had shown almost no progress for an extended period

In addition, kernel log messages such as “hpilo could not dequeue a packet” suggested that the system was getting stuck at a lower level during certain queue operations.

This error may not indicate exactly the same root cause in every StoreOnce case. However, when considered together with the observed behavior in this case, it strongly pointed to a queue blockage or service-level lockup preventing housekeeping from progressing.

As a result, the root cause became clear:

The StoreOnce housekeeping mechanism was unable to progress, pending workload was accumulating, the system could not complete metadata and cleanup operations, and consequently Veeam jobs were experiencing severe performance degradation on the repository side.

Resolution Steps Applied

Once the issue was diagnosed, a controlled and low-risk intervention plan was followed. The steps taken were as follows.

1. Verification of Housekeeping Pending / Processed Metrics

First, the pending and processed housekeeping values were compared through the StoreOnce interface and relevant monitoring screens. The purpose was to determine whether this was only a temporary backlog or a genuinely stuck housekeeping scenario.

2. Review of System, Kernel, and iLO Logs

Logs related to queue operations, packet handling, service stalls, and hardware management were reviewed carefully.

The recurring “hpilo could not dequeue a packet” messages indicated that the issue might involve not only the application layer, but also system services.

3. Validation of Housekeeping Service Behavior

It was important to determine whether housekeeping was fully stopped, appeared active but was not actually progressing, or had become blocked in the process queue.

The analysis confirmed that housekeeping was functionally not progressing.

4. Controlled Reboot Plan

Considering the backlog size, service behavior, and log findings together, a controlled reboot was chosen as the fastest, most reliable, and lowest-uncertainty recovery method.

Naturally, this step was taken with consideration for active workloads, the maintenance window, and the state of the backup chain.

5. Monitoring Housekeeping After Reboot

After the system came back online, housekeeping behavior was monitored again.

The expected outcome was that the stuck queue would be cleared and housekeeping would resume processing.

6. Confirmation That the Pending Queue Was Cleared

Following the reboot, the pending housekeeping value gradually decreased and eventually dropped to zero. Processed housekeeping also returned to normal operation.

Results Observed After the Intervention

Following the reboot, the appliance showed a clear recovery in its overall behavior. The observed results were as follows:

Pending Housekeeping = 0 TB
Housekeeping resumed active operation
Target Bottleneck warnings in Veeam jobs disappeared
Finalize times returned to normal
Backup throughput values returned to expected levels
Repository-side latency decreased noticeably

This improvement practically confirmed that the issue had been caused directly by housekeeping congestion.

Alternative Technical Interventions If Reboot Is Not Possible

In every environment, a reboot may not be feasible. Especially in 24/7 environments, during critical backup windows, or in systems that require formal change approval, more controlled methods may need to be considered first.

In such cases, the following alternatives may be evaluated.

1. Manual Restart of the Housekeeping Service

In some StoreOnce versions, the housekeeping process may be stuck at the service level. In such cases, a controlled service restart via SSH may be attempted.

services housekeeping --stop
services housekeeping --start

This method may recover a housekeeping process that is stalled. However, it may not produce the same result in every version, so the relevant version documentation and vendor recommendations should always be reviewed beforehand.

2. Soft Reset of the Housekeeping Queue

In some cases, instead of a full stop/start operation, the service can be aborted and restarted.

services housekeeping --abort
services housekeeping --start

This method is less radical than a reboot, but may still be effective in stuck queue scenarios. Even so, the overall health of the system should be carefully assessed before applying it.

3. Validation at Catalyst Store or Dedupe Store Level

The issue may not be limited solely to the housekeeping service. It may also be influenced by inconsistency in the underlying store structures or a condition close to metadata corruption.

In such cases, the following checks may be meaningful:

Metadata validation
Chunkstore consistency check
Store-level health checks

These validation operations can be especially useful in high-capacity environments that have been running for long periods under heavy retention and deletion activity.

4. Reviewing Capacity Levels

HPE StoreOnce appliances become more sensitive in terms of housekeeping and general performance at high utilization rates.

In particular, capacity usage above 85% can significantly reduce housekeeping efficiency. Therefore, the following actions may help reduce pressure on the system:

Cleaning up old restore points
Removing unused recovery chains
Optimizing retention policies

5. Checking HPE iLO and Driver/Firmware Compatibility

Although the “hpilo could not dequeue a packet” error does not by itself prove a specific root cause, it raises several possibilities:

An iLO firmware bug
Incompatibility between the driver and the operating system
Queue overflow
Delay or deadlock in the hardware management layer

For this reason, verifying firmware and related component versions according to vendor recommendations is extremely important. In versions affected by known bugs, a patch or update may provide a permanent fix.

6. Evaluating Dedupe Ratio and Store Structure

In some environments, housekeeping congestion may combine with degraded store balance or irregular chunk placement, further worsening performance issues.

In such cases, store-level compaction or maintenance operations recommended by the vendor may be considered. However, these actions should be planned and executed carefully.

The Importance of Symptoms on the Veeam Side

This case also clearly demonstrated how valuable bottleneck information on the Veeam side can be. In many environments, messages such as “Source bottleneck,” “Proxy bottleneck,” or “Target bottleneck” are treated as generic warnings.

However, when interpreted correctly, they provide a highly valuable starting point for identifying the layer in which the problem is concentrated.

In this case, the fact that Veeam continuously reported Target Bottleneck clearly indicated that the issue should be investigated primarily on the repository or target side, rather than on the proxy, network, or source side.

Of course, this information alone does not reveal the full root cause. However, it helps direct the investigation toward the correct layer much more quickly.

This case served as a strong example that backup performance issues cannot always be explained by disk I/O, network latency, or proxy resource constraints. In some situations, congestion in the background housekeeping processes of HPE StoreOnce can slow down the entire system in a cascading manner.

Several key lessons stand out from this experience:

Housekeeping metrics should always be checked during StoreOnce performance investigations
Pending and Processed housekeeping values are critical indicators of system health
Correct interpretation of Veeam bottleneck warnings can significantly reduce diagnosis time
Reboot is often the fastest recovery method, but if reboot is not possible, alternative technical interventions should also be evaluated
Firmware compatibility, driver alignment, capacity level, and store health must not be overlooked for long-term stability

Through correct analysis, layered investigation, and timely intervention, the system was fully stabilized, backup speeds returned to normal levels, and the operational risk was eliminated.