terraform plan -light

Published Jun 4, 2024 by Ricard Bejarano

TL;DR

Add a terraform plan -light flag such that only resources modified in code are targeted for planning.

This would reduce the scope of the pre-plan refresh down to the set of resources we know changed, which reduces overall plan times without the consistency risk of -refresh=false.

For Terraform to know what resources were modified in code, it would store the hash of the serialized sorted attribute map for each resource successfully applied. This would allow diff’ing “last-applied code” vs. “new code”, the result of which is the scope of the next -light plan.

Basically, -light autogenerates the -target list from code changes.

The problem: plan duration

If you’ve used Terraform to manage enough resources, you’ve surely hit a problem with plan duration.

Depending on the amount of resources, the providers and their APIs, it can take hours to plan.

After some quick digging, you’ve probably realized that this is not a problem with planning per se, but rather with the state refresh that runs right before planning.

The actual problem: refresh duration

Terraform state is nothing but a picture of reality, taken the last time you performed an operation on it. However, what if reality changed? What if a resource was modified outside of Terraform?

A Terraform state refresh updates state with the current status of your resources according to the providers.

The pre-plan refresh improves your chances of a consistent plan because it closes the gap between the last time state was updated and right now. It does not guarantee it, however, because out-of-band changes could still happen between end-of-refresh and plan time.

With enough resources, or slow enough providers, refreshes can take a long time. I’ve waited for up to 7 hours for a one-change plan!

The partial solutions

Solution A: increase parallelism

You can think of parallelism as the amount of concurrent threads Terraform spawns for performing resource operations.

When refreshing, the -parallelism number approximates up to how many resources will get refreshed concurrently.

High -parallelism helps until you get rate-limited by the provider. There’s little else you can do when that happens.

Solution B: disable pre-plan refresh

You can disable the pre-plan refresh with the -refresh=false flag, but then you risk your plan’s consistency.

If a resource changed outside of Terraform, and you plan without refreshing, Terraform won’t know about those changes and will issue a plan that might cause unintended consequences.

You can somewhat make up for that by building tooling to refresh state whenever locking state for hours is least annoying, like nightly or weekends, but it’s still a consistency risk. Been there, done that.

Solution C: break down state

Refresh duration is a function of the number of resources in state, so you can break down your one big state into multiple smaller states.

This is so common that every few weekly.tf issues there’s one with a new way or new tool to do this. I’ve even written one myself called Stacks, which we open-sourced at SREcon in 2023!

For example, you may spin off your “production” and “development” environments into separate Terraform states: plan times cut in half, lower blast radius in the event of a mistake, sweet!

This is all good as long as you don’t decouple two sets of dependent resources, but when your “production” state gets big enough, you start thinking about breaking it up again, and at some point you’re forced to split interdependent resources off into their own states.

For example, you might decouple your “network stuff” from your “platform stuff” (using generic terms here purposefully), but your “platform” cluster needs to know what network to live in, and that network is now defined in your “network” state.

You either hard code it (bad), or bridge the two states (for example, with a terraform_remote_state data source). This feels good, but your dependent resources that were previously connected by in-state dependency relationships, are now disconnected.

So you hit another consistency risk: your “network” and “platform” projects don’t necessarily get planned and applied together. If the network you defined earlier, on which the “platform” cluster depends, gets recreated, the cluster won’t.

You can somewhat make up for this by extending your tooling to trigger a plan on “platform” right after “network” applies, but not only is this significantly more complex, you haven’t fixed anything!

You are still going to refresh, plan and apply the same number of resources that you would have, had you kept everything together in the same state. If anything it’ll take longer because you’re now doing it sequentially. Furthermore, you lose the flexibility to make a resource in “network” depend on a resource of “platform”, because otherwise you’d create a dependency loop.

Spinning interdependent resources off into separate states is likely only worth the extra complexity if the parts being splitted have different change frequencies. If your “network” project gets changed once a month, but your “platform” project receives changes daily, the delta in plan times is probably worth the complexity to spin “network” off into its own state.

And while considering all of this, remember that anything that has to do with structuring Terraform is pretty much a one-way door because of how hard it is to refactor into a new structure.

The proposed solution

Add a terraform plan -light flag such that only resources modified in code are targeted for planning.

If you’ve read this far, you’re probably familiar with the -target flag, it lets you filter what resources are in scope for the plan: if you -target just a few of your state resources, only those (and their dependents, recursively) will get refreshed and planned.

However, it’s not convenient to tell Terraform its exact scope every time you make a change. What if Terraform could figure out what resources most likely changed, based on code changes?

That’s what the terraform plan -light flag would be for.

Breakdown

Every time you apply, the applied code is stored in state.
At minimum, a comparable derived from each resource’s code definition, like a checksum of its attribute map that can be compared to the new code’s resource attribute map checksum.
Next time you plan, if -light=true, the current code is compared to the last-applied code from state (if any). The differing resources are our targets for this light plan.

This enables consistent plans (more than -refresh=false, at least) on just the resources that were (likely) changed, reducing refresh duration down to the smallest possible span.

It avoids the need to split state off into separate projects, keeping all of your resources together in the same dependency graph, if you want to.

In exchange, you omit the part where Terraform reconciles drifted resources, but you can have that happen nightly or on weekends when locking state for hours is less annoying, and if the nightly “heavy” plan is non-zero (i.e. has changes), notify the owner and act accordingly.

What now?

I’ve filed issues both in Terraform and OpenTofu to get this implemented. Upvote them if you like the idea.

Thanks for dropping by!

Did you find what you were looking for?
Let me know if you didn't.

Have a great day!