A random failure: Problems upgrading Ansible and Python


The story of a data loss bug I caused at work, how I fixed it, and some surprising behavior in Python and Ansible



Onion Details



Page Clicks: 0

First Seen: 03/11/2024

Last Indexed: 10/21/2024

Domain Index Total: 54


WARNING
1 Clone Has Been Detected!

lcbepwrhvbzbg4fkoe4bbua7q5naijmvhukj3xcvfbvyq6o45qgjgnyd.onion



Onion Content



↩︎ /blog/ 2020 0520 A random failure: Problems upgrading Ansible and Python The story of a data loss bug I caused at work, how I fixed it, and some surprising behavior in Python and Ansible. Technologies: ansible python Tags: work What follows is the story of a data loss bug I caused at work, how I fixed it, and my journey to discovering surprising behavior in Python and Ansible. Start from the top to read the full story Read the end of the story to find the strategy I will use to prevent similar problems in the future Skip to the Epilogue for in-depth technical testing and findings, as well as the bugs I reported and the responses from the dev teams Chapter 1: The glorious upgrade Wherein we undertake major Ansible and Python upgrades at work, and in so doing, discover an unfortunate, undocumented behavior change. Ominous foreshadowing It’s a Friday. I am “on triage”, meaning I am the point person for all new tickets and requests. In the mid-afternoon, a coworker lets me know that some of our machines have wiped important data after a reboot I don’t take this too seriously. I’m aware that these machines have been recently moved around, and that wiping data is expected in such circumstances. I ask him to gather more data, but I don’t investigate on my own. This will prove to be a mistake. When my coworker gets back to me with additional information, it’s clear that machines which had rebooted since Tuesday are losing data incorrectly. What happened on Tuesday? Tuesday is the day I upgraded Ansible from 2.5 to 2.9, and - it turns out, most critically - Python from 2.7 to 3.6, for all of our Ansible provisioners globally. Some of my best work … … is implicated in this failure. Last year, I wrote and deployed a system that encrypted certain critical data on every machine running our customized Linux distribution within the firm, worldwide. My team collaborated with me for the design, but the implementation and deployment were completely on my shoulders. In any other circumstance, I would be really proud of that. We actually intended for it to wipe this data on boot - under certain conditions. Namely, if the machine were repurposed, or if the underlying hardware were swapped. This would protect us in at least two scenarios: If a recycler gets careless and liquidates machines without destroying disks, potentially leaking proprietary data belonging to us or our customers. If a machine gets repurposed, and contains data that new users are not allowed to see. In our world, changing the purpose of a machine to this extent would always come with a hostname change. I implemented this design by mixing a randomly generated secret (stored securely) with the hostname and serial number of the machine to form an encryption password. The system worked in production for at least 5 months, and has been deployed to hundreds of machines so far. Any machine that is new, moved, or renamed will have this system applied to it automatically. (Machines already in production when I deployed this system will continue to run without it after rebooting until they are moved, renamed, or retired.) We are on our way to rolling it out to several thousand machines around the world. I certainly haven’t forgotten that the Ansible/Python upgrade was also my project. I will soon discover that my creations are conspiring against me. Because today, data is getting wiped from machines which have NOT changed hostnames or serial numbers. Chapter 2: Here be dragons Wherein we come face to face with certain death. I am actually pretty lucky The machines which my coworker found to have lost data are not yet in production. They didn’t have any critical data on them when they were rebooted, and there are about ten of them in total, which means I have a pool to test with. I’m also lucky that no other machines running this code have been rebooted since Tuesday. So, you know. It could be worse. I’m sweating pretty hard at this point It’s about 4PM. Did I mention it’s Friday? It’s also A ’s birthday, and things haven’t gone… perfectly. A few gifts have arrived late or wrong, and we’re under the damned COVID19 lockdown, so the best party we can throw is margarita delivery and a FaceTime call. Two of the most impactful changes I’ve made at the company are colliding at a really inconvenient time. REBOOTS MAY CAUSE DATA LOSS The first thing to do is write a strongly worded post on the internal wiki. IT Notables - REBOOTS MAY CAUSE DATA LOSS, investigation ongoing Not the kind of thing you generally want to have your name on. Fortunately, we have no critical reboots scheduled for this weekend, and after the wiki post, emails, chat rooms, and a lot of swearing in the privacy of my home office, I get to work as quickly as possible. Debugging I can’t really describe the initial panic. It’s a drowning feeling. I know this is a tractable problem, but the timing, the birthday, the attention I’d called to it, and the fact that they are my two biggest projects of my time at the firm put some pretty ugly visions in my head. Debugging My job is not a calm introvert’s paradise on the best day and, well, whenever that best day was, we’d started a global pandemic since then. I’m breathing shallowly and my eyes are darting around. I’m not sure what to look at first. Debugging Right. I’ve got to fix this. Chapter 3: The plot thickens Wherein we press on against imminent destruction amidst thick swamps of doubt, to find ourselves at an intriguing discovery. Narrowing down the problem I add some elegant 1 printf debugging 2 to the Ansible role responsible for the encryption. Between that and a few shells open in different Python virtual environments, I am able to determine that yes, under our new Ansible 2.9/Python 3.6 system, the passwords I’m generating are different than under the old Ansible 2.5/Python2.7 system. To narrow down the problem, I test Ansible 2.9 under Python 2.7 3 , and verify that change alone will generate different passwords for the same input. But why? Some more specifics on the algorithm Here are relevant tasks from my Ansible role, in pseudocode. name : Get the secret get_secret : ... register : secret_result - name : Calculate encryption passphrase vars : salt : "{{ 65534 | random(seed=inventory_hostname+ansible_product_serial) | string }}" set_fact : passphrase : "{{ secret_result + ... | password_hash('sha512', salt) }}" With a little more debugging, I can see that the salt is different under Python 3 than it was under Python 2. Incidentally, I copied this magic 65534 number out of the Ansible filter documentation without thinking too much about it. Those docs use it in an example very close to what I am trying to do in my Ansible role: An idempotent method to generate unique hashes per system is to use a salt that is consistent between runs: {{ 'secretpassword' | password_hash ( 'sha512' , 65534 | random ( seed = inventory_hostname ) | string ) }} Is that supposed to work? I remember when I wrote this implementation, I had wanted to make sure the password would be the same every time, given the same secret material, hostname, and serial number. I know I had checked the Ansible documentation : As of Ansible version 2.3, it’s also possible to initialize the random number generator from a seed. This way, you can create random-but-idempotent numbers… Hmm. Ansible says it should be the same since version 2.3. There’s no mention of any breakage from new Ansible or Python versions here. Did Ansible’s implementation change? In the stable 2.9 branch, I find the code in question . I do see changes in the history, but nothing seems to really have changed the algorithm in the Ansible code. In every version, from 2.5 to 2.9 (and probably before and after), calling the random filter as I do results in a call to Python’s random.Random() with the optional seed parameter passed. I do find one curious thing: in the 2.5 code for the filter , it increments the end argument (my magic 65534 number) by one. Uhh. But wait. In the 2.9 code for the filter , this doesn’t seem to happen? I don’t understand why this is. I try copying the Ansible 2.9 version of the filter and modifying it to increment the end argument by 1 as the Ansible 2.5 version does. This produces results dissimilar to both Ansible 2.5/Python 2.7 and Ansible 2.9/Python3.6. I decide to leave this question unanswered. My previous test shows that Ansible 2.5 under Python 3.6 has the same wrong result as the Ansible 2.9/Python3.6 production deployment. I think that line of inquiry will be more fruitful. Did Python’s implementation change? It turns out, this code returns a different result under Python 2 and Python 3. import random print ( random . Random ( "example test seed" ) . randrange ( 0 , 65535 , 1 )) On my personal Mac, at least, Python 2 gives me a result of 35879 , while Python 3 gives me a result of 13619 . Chapter 4: Intermission Wherein we enjoy the comforts of the world, and sleep the sleep, not of the dead, but of the dying 4 . A good decision I stop here. By this point, I have what I believe is a fix, but it isn’t tested yet. I’m not sure that there isn’t also something else wrong, hidden behind the change in random.Random() behavior. I’ve already spoken with my team, letting them know that I would need to spend some time with A , and that I might not have a fix pushed until the next day. I leave for a while to join the FaceTime party. (I look sadly at the freshly made margarita pitcher.) And I go back to my desk to try to finish up the night of. Ultimately, though, I decide not to attempt a production fix before bed. Rather than exhaustedly deploying at the end of the night, I would get a good night’s sleep and tackle the problem when there was more runway. Chapter 5: An ignoble solution Wherein we commit evil in pursuit of good; wherein we sin, yet achieve salvation. Implementation The most dumbfuck dipshit fix possible is to shell out to the python2 binary from within Python 3 Ansible and run the filter code as a one-liner. You had better fucking believe that’s exactly what I do. It is true that this is kicking the can down the road. At some point, we’ll install machines without python2 , right? I figure, we have plenty of runway until then, and an obvious failure during testing when we call python2 in the Ansible role. Plus, I was pretty anxious to let people reboot machines again. Testing What comes next is testing, which is tricky. I had a limited number of machines that were booted before my Python 3 upgrade and contain data no one would miss. I have fewer than 10 machines like that easily available. Even so, my implementation works perfectly on every one of them. I feel comfortable pushing the change, and immediately remind everyone I once made a horrible mistake communicate to everyone that it is now safe to reboot machines. I close my laptop and am able to enjoy the rest...