What's the most catastrophic OTA failure you've seen?

The Fitbit OTA bricking incident is the one everyone remembers. Thousands of devices rendered unusable because an update failed and there was no recovery path. The root cause? The system marked the update as successful before it was fully running. That's the pattern I see over and over: the firmware assumes that writing to flash = done. It's not.

How common are OTA failures really?

I've seen a statistic floating around: 8.5% of devices fail within three years due to poor OTA design. That seems high to me – nearly 1 in 12 devices? I'd want to see the methodology. But even if it's half that, it's still unacceptable for industrial, medical, or automotive systems. The failure modes I actually see in the field: • Power interruption during installation (the #1 cause of individual device failure) • Corrupted downloads that pass a weak checksum • Staged rollouts that hit the same edge-case devices every time • No rollback mechanism when the new firmware crashes on boot

When a manufacturer comes to you after bricking hundreds of devices, what's the root cause?

For a single device? Power loss or interruption during installation. For hundreds of devices? Inadequate testing of the rollout, combined with: • No architectural rollback (e.g., no dual-bank or factory fallback) • Untested edge cases (e.g., what happens if the device resets at byte 4096 of the write?) • Operational failures (e.g., pushing to 100% of devices on a Friday afternoon) It's rarely one thing. It's a cascade of small decisions that each seemed reasonable at the time.

Dual-bank (A/B) partitioning: essential or overkill?

Dual-bank is the industry's gold standard. But here's the truth: dual-bank is actually the silver standard. The real gold standard is dual-bank plus a factory fallback image, so if both A and B are corrupted or faulty, the device can still recover. That requires three times the program memory. Most embedded systems don't have that luxury. So work with what you have. If you only have space for a single image and a bootloader, design the bootloader to re-fetch and re-install on any failure. It's not perfect, but it's better than a brick.

What's your bootloader strategy for industrial systems?

Two hard rules: 1. Rollback is essential during development. You'll need it constantly as you iterate. 2. Rollback must be disabled for production images. A device in the field should not be able to revert to an older, potentially vulnerable firmware version. There are exceptions to systems with very reduced attack vectors, but for anything running an operating system or connected to a network, disable rollback in production.

How do you architect power-loss resilience?

The best approach is simple: don't mark an update as successful until the application code is up and running and has reported back to the server. Until that point, any reset should cause the bootloader to either: • Revert to the previous known working image, or • Reinstall the update from scratch. Note: Re-installing doesn't help if the update package itself was faulty. That's why you verify before you even start writing.

How do you handle firmware version mismatches across different hardware revisions?

I prefer the firmware to handle all variants itself, as long as the MCU hasn't changed. On startup, the firmware identifies the PCB revision by reading a resistor divider or checking a GPIO strapping. Or, it probes for different component versions during initialisation, "Is there a BME280 here or an older SHT30?" That way, you maintain a single firmware image across a population of devices with different peripheral configurations.

What does the Cyber Resilience Act actually require for OTA?

It all comes down to the root of trust. Signing keys, command keys, and certificates need to be stored in a Secure Element, not in flash or firmware. This EN 18031 Cybersecurity Requirements Assessment Tool helps to identify gaps early. Those keys must remain confidential throughout the production process. If you don't have a hardware root of trust and you're rolling your own key management, the CRA will cause you real problems. Start now.

What happens when TLS certificates expire on deployed devices?

Honest answer: I'm not aware of my team's handling this poorly, but I know it's a blind spot for many manufacturers. The solution is either: • Automatic renewal using a protocol like ACME (if your device has enough connectivity and logic) • Building devices with certificate lifetimes that exceed the expected service life • A manual but well-documented field update procedure. Don't ignore this. Expired certificates mean no secure OTA, which means no updates at all.

For ATEX-certified environments: can you do wireless OTA in Zone 1 or Zone 2?

Yes, it's possible, but you need to ensure the update mechanism itself doesn't introduce a spark risk. Wireless in a hazardous area is about the radio, not the firmware. If the device is already ATEX-certified for wireless operation, an OTA update doesn't change that, provided the update doesn't alter power or RF behaviour. That said, it's best not to assume anything. I'd want to run any OTA design for a Zone 1 device past the certifying body.

What failure scenarios do you always simulate before deployment?

Always: • Intermittent connectivity (drop the connection randomly) • Corrupted downloads (flip a few bits in the package) • Mid-flash power loss (cut power at 10%, 50%, 90%) • Unexpected device resets during installation. Design level: • Block-level retries with checksums • Full-package verification with SHA256 (MD5 is acceptable for less critical systems) • Acquire the update package as quickly as power, storage, and bandwidth allow. The acquisition stage is the most vulnerable

How do you test OTA on 5-year-old devices with degraded flash?

On every release, we re-test OTA. Every single time. What worked six months ago might break after a compiler upgrade or a change in package size. That said, I'm not aware of anyone realistically simulating degraded flash beyond normal integrity checks. The mitigation is: verify everything, and if a write fails due to a bad block, your bootloader should handle that gracefully and fall back.

Staged rollouts for small populations (dozens or hundreds, not millions)?

Yes, still do staged rollouts. I use a combination of fixed and temporal variables: • Fixed: device ID modulo something • Temporal: 1% of devices per day, or 10% per week The temporal component means it's not always the same devices that get the update first. That catches edge cases you'd otherwise miss.

When should you build OTA in-house vs outsource?

We work with whatever OTA distribution method you already prefer FTP(S), HTTP, MQTT, etc,. We don't enforce a one-size-fits-all. You should call ByteSnap Design when: • You've already bricked devices and need a Design Rescue • You're not sure your bootloader can handle power loss during installation • You need CRA compliance and don't have a secure element strategy • You want an independent audit before shipping 10,000 units. You can probably handle this in-house if your team has done this before, you have all the architectural pieces in place, and you've already tested against the failure scenarios above. Once OTA is deployed, don't ignore long-term patching. Our managed security service for embedded Linux devices handles ongoing vulnerability updates because an OTA mechanism is only useful if someone actually ships updates through it

What do people get wrong when they roll their own OTA?

Three things: 1. Connectivity resilience. It's always harder than you think. 2. Storage constraints. Many systems don't have room for dual-bank, let alone triple-bank. 3. Key management. Changing regulatory requirements are making this the biggest headache.

One piece of advice for someone designing their first OTA-capable product?

Design OTA at the very start of development, rather than halfway through or as an afterthought. Build it into the first prototype. Test it alongside everything else. By the time you reach alpha testing, the OTA system should already be battle-tested. Consider acquisition, verification, update, and failure handling from day one. Those four things will determine whether your product survives in the field.

What's the most catastrophic OTA failure you've seen?

The Fitbit OTA bricking incident is the one everyone remembers. Thousands of devices rendered unusable because an update failed and there was no recovery path. The root cause? The system marked the update as successful before it was fully running. That's the pattern I see over and over: the firmware assumes that writing to flash = done. It's not.

How common are OTA failures really?

I've seen a statistic floating around: 8.5% of devices fail within three years due to poor OTA design. That seems high to me – nearly 1 in 12 devices? I'd want to see the methodology. But even if it's half that, it's still unacceptable for industrial, medical, or automotive systems. The failure modes I actually see in the field: • Power interruption during installation (the #1 cause of individual device failure) • Corrupted downloads that pass a weak checksum • Staged rollouts that hit the same edge-case devices every time • No rollback mechanism when the new firmware crashes on boot

When a manufacturer comes to you after bricking hundreds of devices, what's the root cause?

For a single device? Power loss or interruption during installation. For hundreds of devices? Inadequate testing of the rollout, combined with: • No architectural rollback (e.g., no dual-bank or factory fallback) • Untested edge cases (e.g., what happens if the device resets at byte 4096 of the write?) • Operational failures (e.g., pushing to 100% of devices on a Friday afternoon) It's rarely one thing. It's a cascade of small decisions that each seemed reasonable at the time.

Dual-bank (A/B) partitioning: essential or overkill?

Dual-bank is the industry's gold standard. But here's the truth: dual-bank is actually the silver standard. The real gold standard is dual-bank plus a factory fallback image, so if both A and B are corrupted or faulty, the device can still recover. That requires three times the program memory. Most embedded systems don't have that luxury. So work with what you have. If you only have space for a single image and a bootloader, design the bootloader to re-fetch and re-install on any failure. It's not perfect, but it's better than a brick.

What's your bootloader strategy for industrial systems?

Two hard rules: 1. Rollback is essential during development. You'll need it constantly as you iterate. 2. Rollback must be disabled for production images. A device in the field should not be able to revert to an older, potentially vulnerable firmware version. There are exceptions to systems with very reduced attack vectors, but for anything running an operating system or connected to a network, disable rollback in production.

How do you architect power-loss resilience?

The best approach is simple: don't mark an update as successful until the application code is up and running and has reported back to the server. Until that point, any reset should cause the bootloader to either: • Revert to the previous known working image, or • Reinstall the update from scratch. Note: Re-installing doesn't help if the update package itself was faulty. That's why you verify before you even start writing.

How do you handle firmware version mismatches across different hardware revisions?

I prefer the firmware to handle all variants itself, as long as the MCU hasn't changed. On startup, the firmware identifies the PCB revision by reading a resistor divider or checking a GPIO strapping. Or, it probes for different component versions during initialisation, "Is there a BME280 here or an older SHT30?" That way, you maintain a single firmware image across a population of devices with different peripheral configurations.

What does the Cyber Resilience Act actually require for OTA?

It all comes down to the root of trust. Signing keys, command keys, and certificates need to be stored in a Secure Element, not in flash or firmware. This EN 18031 Cybersecurity Requirements Assessment Tool helps to identify gaps early. Those keys must remain confidential throughout the production process. If you don't have a hardware root of trust and you're rolling your own key management, the CRA will cause you real problems. Start now.

What happens when TLS certificates expire on deployed devices?

Honest answer: I'm not aware of my team's handling this poorly, but I know it's a blind spot for many manufacturers. The solution is either: • Automatic renewal using a protocol like ACME (if your device has enough connectivity and logic) • Building devices with certificate lifetimes that exceed the expected service life • A manual but well-documented field update procedure. Don't ignore this. Expired certificates mean no secure OTA, which means no updates at all.

For ATEX-certified environments: can you do wireless OTA in Zone 1 or Zone 2?

Yes, it's possible, but you need to ensure the update mechanism itself doesn't introduce a spark risk. Wireless in a hazardous area is about the radio, not the firmware. If the device is already ATEX-certified for wireless operation, an OTA update doesn't change that, provided the update doesn't alter power or RF behaviour. That said, it's best not to assume anything. I'd want to run any OTA design for a Zone 1 device past the certifying body.

What failure scenarios do you always simulate before deployment?

Always: • Intermittent connectivity (drop the connection randomly) • Corrupted downloads (flip a few bits in the package) • Mid-flash power loss (cut power at 10%, 50%, 90%) • Unexpected device resets during installation. Design level: • Block-level retries with checksums • Full-package verification with SHA256 (MD5 is acceptable for less critical systems) • Acquire the update package as quickly as power, storage, and bandwidth allow. The acquisition stage is the most vulnerable

How do you test OTA on 5-year-old devices with degraded flash?

On every release, we re-test OTA. Every single time. What worked six months ago might break after a compiler upgrade or a change in package size. That said, I'm not aware of anyone realistically simulating degraded flash beyond normal integrity checks. The mitigation is: verify everything, and if a write fails due to a bad block, your bootloader should handle that gracefully and fall back.

Staged rollouts for small populations (dozens or hundreds, not millions)?

Yes, still do staged rollouts. I use a combination of fixed and temporal variables: • Fixed: device ID modulo something • Temporal: 1% of devices per day, or 10% per week The temporal component means it's not always the same devices that get the update first. That catches edge cases you'd otherwise miss.

When should you build OTA in-house vs outsource?

We work with whatever OTA distribution method you already prefer FTP(S), HTTP, MQTT, etc,. We don't enforce a one-size-fits-all. You should call ByteSnap Design when: • You've already bricked devices and need a Design Rescue • You're not sure your bootloader can handle power loss during installation • You need CRA compliance and don't have a secure element strategy • You want an independent audit before shipping 10,000 units. You can probably handle this in-house if your team has done this before, you have all the architectural pieces in place, and you've already tested against the failure scenarios above. Once OTA is deployed, don't ignore long-term patching. Our managed security service for embedded Linux devices handles ongoing vulnerability updates because an OTA mechanism is only useful if someone actually ships updates through it

What do people get wrong when they roll their own OTA?

Three things: 1. Connectivity resilience. It's always harder than you think. 2. Storage constraints. Many systems don't have room for dual-bank, let alone triple-bank. 3. Key management. Changing regulatory requirements are making this the biggest headache.

One piece of advice for someone designing their first OTA-capable product?

Design OTA at the very start of development, rather than halfway through or as an afterthought. Build it into the first prototype. Test it alongside everything else. By the time you reach alpha testing, the OTA system should already be battle-tested. Consider acquisition, verification, update, and failure handling from day one. Those four things will determine whether your product survives in the field.

Designing Firmware for Secure OTA Updates – Without Bricking Your Device

⏩ TL;DR: How do you design OTA firmware updates that won't brick your devices in the field?
Answer: Follow this five-step process:
Don't mark an update as successful until the application is running and reporting back to your server. Until that point, any reset should revert to the last known working image or re-install..
Verify at block level (checksums) AND package level (SHA256). Corrupted downloads are more common than you think – and weak verification is why they brick devices.
Design your bootloader to handle power loss at any point during installation. Power interruption isn't rare – it's the #1 cause of individual device failure.
Disable rollbacks in production. Rollback is essential during development but a security risk in the field.
Secure all keys – signing keys, command keys, certificates – in a hardware root of trust. No exceptions under CRA.

What this solves: Power interruption causes most single-device bricks. Corrupted downloads and no rollback path cause most mass failures. This process addresses both.

You’re designing OTA-capable firmware. Or you’re already in the field and something’s gone wrong.

I’ve been there. Over 15 years of fixing OTA failures, I’ve seen what breaks and what survives.

Power loss mid-update.

Corrupted downloads.

Bootloaders that can’t roll back.

Production images that should have had rollbacks disabled.

And the occasional “we didn’t test staged rollouts” disaster.

Here are the most common questions I get and the honest answers.

No theory. Just what I’ve seen work, and what I’ve seen brick devices.

Section 1: The Real Cost of OTA Failures

What's the most catastrophic OTA failure you've seen?
The Fitbit OTA bricking incident is the one everyone remembers. Thousands of devices rendered unusable because an update failed and there was no recovery path.
The root cause? The system marked the update as successful before it was fully running. That's the pattern I see over and over: the firmware assumes that writing to flash = done.
It's not.
How common are OTA failures really?
I've seen a statistic floating around: 8.5% of devices fail within three years due to poor OTA design.
That seems high to me – nearly 1 in 12 devices? I'd want to see the methodology.
But even if it's half that, it's still unacceptable for industrial, medical, or automotive systems.
The failure modes I actually see in the field:
• Power interruption during installation (the #1 cause of individual device failure)
• Corrupted downloads that pass a weak checksum
• Staged rollouts that hit the same edge-case devices every time
• No rollback mechanism when the new firmware crashes on boot
When a manufacturer comes to you after bricking hundreds of devices, what's the root cause?
For a single device? Power loss or interruption during installation. For hundreds of devices? Inadequate testing of the rollout, combined with:
• No architectural rollback (e.g., no dual-bank or factory fallback)
• Untested edge cases (e.g., what happens if the device resets at byte 4096 of the write?)
• Operational failures (e.g., pushing to 100% of devices on a Friday afternoon)
It's rarely one thing. It's a cascade of small decisions that each seemed reasonable at the time.

Section 2: Architecture choices that save your devices

Dual-bank (A/B) partitioning: essential or overkill?
Dual-bank is the industry's gold standard.
But here's the truth: dual-bank is actually the silver standard. The real gold standard is dual-bank plus a factory fallback image, so if both A and B are corrupted or faulty, the device can still recover.

That requires three times the program memory. Most embedded systems don't have that luxury.

So work with what you have. If you only have space for a single image and a bootloader, design the bootloader to re-fetch and re-install on any failure. It's not perfect, but it's better than a brick.
What's your bootloader strategy for industrial systems?
Two hard rules:

1. Rollback is essential during development. You'll need it constantly as you iterate.
2. Rollback must be disabled for production images. A device in the field should not be able to revert to an older, potentially vulnerable firmware version.

There are exceptions to systems with very reduced attack vectors, but for anything running an operating system or connected to a network, disable rollback in production.
How do you architect power-loss resilience?
The best approach is simple: don't mark an update as successful until the application code is up and running and has reported back to the server. Until that point, any reset should cause the bootloader to either:

• Revert to the previous known working image, or
• Reinstall the update from scratch.

Note: Re-installing doesn't help if the update package itself was faulty. That's why you verify before you even start writing.
How do you handle firmware version mismatches across different hardware revisions?
I prefer the firmware to handle all variants itself, as long as the MCU hasn't changed.

On startup, the firmware identifies the PCB revision by reading a resistor divider or checking a GPIO strapping.
Or, it probes for different component versions during initialisation, "Is there a BME280 here or an older SHT30?"

That way, you maintain a single firmware image across a population of devices with different peripheral configurations.

Section 3: Security & Compliance (CRA / EN 18031)

What does the Cyber Resilience Act actually require for OTA?
It all comes down to the root of trust.

Signing keys, command keys, and certificates need to be stored in a Secure Element, not in flash or firmware. This EN 18031 Cybersecurity Requirements Assessment Tool helps to identify gaps early.
Those keys must remain confidential throughout the production process.
If you don't have a hardware root of trust and you're rolling your own key management, the CRA will cause you real problems.

Start now.
What happens when TLS certificates expire on deployed devices?
Honest answer: I'm not aware of my team's handling this poorly, but I know it's a blind spot for many manufacturers.

The solution is either:

• Automatic renewal using a protocol like ACME (if your device has enough connectivity and logic)

• Building devices with certificate lifetimes that exceed the expected service life

• A manual but well-documented field update procedure. Don't ignore this.

Expired certificates mean no secure OTA, which means no updates at all.
For ATEX-certified environments: can you do wireless OTA in Zone 1 or Zone 2?
Yes, it's possible, but you need to ensure the update mechanism itself doesn't introduce a spark risk.

Wireless in a hazardous area is about the radio, not the firmware. If the device is already ATEX-certified for wireless operation, an OTA update doesn't change that, provided the update doesn't alter power or RF behaviour.

That said, it's best not to assume anything. I'd want to run any OTA design for a Zone 1 device past the certifying body.

Section 4: Testing - the boring bit that saves your "stars"🌟🌟🌟

What failure scenarios do you always simulate before deployment?
Always:

• Intermittent connectivity (drop the connection randomly)

• Corrupted downloads (flip a few bits in the package)

• Mid-flash power loss (cut power at 10%, 50%, 90%)

• Unexpected device resets during installation. Design level:

• Block-level retries with checksums

• Full-package verification with SHA256 (MD5 is acceptable for less critical systems)

• Acquire the update package as quickly as power, storage, and bandwidth allow. The acquisition stage is the most vulnerable
How do you test OTA on 5-year-old devices with degraded flash?
On every release, we re-test OTA.

Every single time.

What worked six months ago might break after a compiler upgrade or a change in package size.

That said, I'm not aware of anyone realistically simulating degraded flash beyond normal integrity checks.

The mitigation is: verify everything, and if a write fails due to a bad block, your bootloader should handle that gracefully and fall back.
Staged rollouts for small populations (dozens or hundreds, not millions)?
Yes, still do staged rollouts. I use a combination of fixed and temporal variables:

• Fixed: device ID modulo something

• Temporal: 1% of devices per day, or 10% per week

The temporal component means it's not always the same devices that get the update first. That catches edge cases you'd otherwise miss.

Section 5: Build it yourself or bring us in?

When should you build OTA in-house vs outsource?
We work with whatever OTA distribution method you already prefer FTP(S), HTTP, MQTT, etc,. We don't enforce a one-size-fits-all.

You should call ByteSnap Design when:

• You've already bricked devices and need a Design Rescue

• You're not sure your bootloader can handle power loss during installation

• You need CRA compliance and don't have a secure element strategy

• You want an independent audit before shipping 10,000 units.

You can probably handle this in-house if your team has done this before, you have all the architectural pieces in place, and you've already tested against the failure scenarios above.

Once OTA is deployed, don't ignore long-term patching. Our managed security service for embedded Linux devices handles ongoing vulnerability updates because an OTA mechanism is only useful if someone actually ships updates through it
What do people get wrong when they roll their own OTA?
Three things:

1. Connectivity resilience. It's always harder than you think.

2. Storage constraints. Many systems don't have room for dual-bank, let alone triple-bank.

3. Key management. Changing regulatory requirements are making this the biggest headache.
One piece of advice for someone designing their first OTA-capable product?
Design OTA at the very start of development, rather than halfway through or as an afterthought.

Build it into the first prototype. Test it alongside everything else.

By the time you reach alpha testing, the OTA system should already be battle-tested.

Consider acquisition, verification, update, and failure handling from day one.

Those four things will determine whether your product survives in the field.

You need a system that gracefully handles real-world failures.

Power loss will happen. Connectivity will drop. Flash will degrade. Someone will push a corrupted package. Design for those things, and you'll avoid the 2am phone call about a thousand bricks in the field. If you want someone to audit your OTA architecture before that phone call happens, let's talk.

Book an OTA architecture review

Martin Thompson

Martin is ByteSnap Design’s Princpal Engineer and has been writing embedded software since 1987 – before most of today’s connected devices existed.

A Bath-trained engineer with an M.Eng. in Electrical and Electronics Engineering, his career spans satellite communications at Marconi, digital video for gaming systems, and mobile middleware development for global telecoms platforms.

Since joining ByteSnap Design in 2009, he has been the technical backbone behind some of the UK’s most complex embedded firmware and software projects.