Practical system administration, agility, and DevOps are rooted in successful enterprises. Drawn from his deep experiences authoring pioneering top books and working with Stack Overflow, Bell Labs, Google, Usenix, ACM, and Queue, we have a golden opportunity to obtain meaningful guidance from a top expert in the world, Thomas A. Limoncelli.
Who is Thomas A. Limoncelli?
Thomas is an internationally recognized author, speaker, and system administrator. He is best known for his books Practice of System and Network Administration, Time Management for System Administrators, and The Practice of Cloud System Administration. His first book is cited as inspiring a generation of system administrators. In 2005, he received the Usenix LISA Outstanding Achievement Award. He holds a BA in CS from Drew University and has worked at companies such as Bell Labs, Google and Stack Overflow.
To listen to the interview you can go to the non-profit ACM Learning Center podcasts or click on this MP3 file link in the learning centre
Here are extracts from the full interview:
Ibaraki:
You’ve had some interesting roles (such as with Bell Labs, Google and others), can you share some of the lessons from those roles?
Thomas:
Two of the big lessons I learned at the Labs was how to learn and the importance of always be learning. I also learned that one of the most important questions you can ask is: why are we doing this? and often we don’t pause and ask, what problem are we trying to solve? The other thing I learned at Bell Labs is to stop doing things that don’t work. It sounds like an obvious thing, but it’s amazing how much we do and how much is done in the world because it’s what we’ve always done. After the Labs, I was at a couple of small companies and I guess what I learned there is that I’d rather be at a big company. Then I was at Google for seven years and that was really amazing. I learned a lot. The way to manage a project in ways that are much more aggressive and another thing that Google did was assume success. I’ve been at organizations where success was not assumed so it became okay to not succeed, but if you are not assuming that you are going to succeed, then you are going to plan for 5 different outcomes. You are going to plan the success outcome, but you are going to have a backup plan in case things don’t succeed and you are going to do five times as much work. The other lesson I learned at Google was the future is 24/7 — I kind of knew that intuitively, but Google is really, really 24/7. The other thing I learned about distributed computing is at Google the economics of distributed computing just seem so much better. It was such a gift to be there when a lot of the early operational aspects around how to keep systems running were being developed, iterated on and improved. The last two and half years, I’ve been at stackoverflow.com. It’s a question and answer website and even though it’s very large in its impact, it’s a relatively small shop. What I learned there: I have a UNIX background and Stack Overflow is a mixture of UNIX and Windows, so applying all this stuff that I’ve learned to a Windows environment it’s been a unique challenge, as well as just being at such a smaller scale.
Ibaraki:
What tips can you share (updated to reflect current and future problems/solutions), from your notable books: “Practice of System and Network Administration”, “Time Management for System Administrators” and “The Practice of Cloud System Administration”?
Thomas:
My first book was the “Practice of System and Network Administration” that I co-wrote with Christina Hogan (in later editions we added another co-author Strata Chalup). The point of that book was we wanted it to be a general system administration book, but we didn’t want it to be technical, we wanted it to be strategic. The themes that run throughout the book are really about people skills and automation and simplicity. I think it’s the first book on system administration that is not technology specific, it’s vendor agnostic and I think that’s what made it special. The next book that I wrote was “Time Management for System Administrators”. It dawned on me that the biggest problem in IT is time management. It’s basically all my coping mechanisms, strategies and it’s also a very short book because I think people who need a book like this can’t read a long book. The take-homes from that book are essentially you need two things to be successful in IT from a personal management point of view: one, is a way to manage the interruptions that are coming in and second, you need a way to organize your work. The third book is: “Practice of Cloud System Administration” which I wrote with Christina Hogan and Strata Chalup. That was an attempt to take all the lessons that I learned from Google and from the DevOps company and try to capture that. Even though we had Cloud in the title it’s not really just a Cloud book. The first half of the book is about the design of large distributed systems and even if it might not be designed yet, you need the vocabulary so that you can talk to the designers. The second half is about operating large complex systems and this was very much what we learned working at Google and other large companies.
Ibaraki:
Can you share some key takeaways from the Usenix LISA conference?
Thomas:
If you are looking for a way to network and at the same time help the world, get involved in conference planning in and out of tech. I’ve mostly been involved in Usenix-related conferences and I attended my first Usenix conference in 1989 on a student grant. What have I learned through conferences? I learned it’s not just what you know, but who you know. I believe that everyone in the world has some kind of super power and my super power is my network. Someone can ask me a technical question about some technology I’ve never heard of and I feel like because I’ve attended enough conferences I can find someone who does know about that technology and reach out to that person. The LISA conference is not an academic conference but it has academic origins. The most important thing that I’ve learned there is to front-load the talk. A much better format is that you begin with the surprise and say, we achieved this with this algorithm and then explain the details.
Ibaraki:
Let’s drill further into your work at Stack Overflow. Can you describe some scenarios you encountered and their solutions and maybe some notable results from your work?
Thomas:
Doing those kinds of failovers, either data center failovers, little database master/slave failovers, these are risky things and what I have learned is that if something is risky, you should do it a lot (which is quite radical because a lot of people think if it’s risky, we should avoid it). But in technology a risky process is inherently risky, but a risky procedure is only risky if you don’t do it often enough, so we forced ourselves to do it a lot. We started doing fire drills every other month until we got a lot better at doing these things and each time we did it we found some bugs and we learned lots of problems. The thing we learned was there was only one person who knew how to do a certain procedure so that person couldn’t be on vacation if we had a real emergency. That was an important learning, we tightened and put more people on the team that knew that particular task and there were certain things that were very cumbersome and they were automated, or error prone and they were better documented.
Ibaraki:
In your current role, can you comment on some of the useful resources (including all of your current and future books)?
Thomas:
One thing I want to say about the books is that the most gratifying results of the book have come from the fans. There are some companies that practice system administration and there are some companies where all new employees get a copy which is quite flattering. I got an email from a woman in Japan who thought that she was the only person in IT that was struggling. And reading this book and the various anecdotes in it she realized that she wasn’t the only one, and she said that she cried and that she was so happy to know that IT can be stressful, but can be so much better.
Ibaraki:
I know you do work for non-profits such as the Association for Computing Machinery (ACM); can you talk about some of that work?
Thomas:
I love working with ACM. I’ve been an ACM member since 1988 as a freshman in college at the great student discount. I’ve been to computing competitions and later after graduation got involved in some committees. I find that the ACM publications are always very enlightening. What I’ve gotten out of volunteering with them is more networking contacts and also a lot of useful stories that I’ve been able to use in my writing.
Ibaraki:
Can you talk more about some of the continuing value of Queue, perhaps what we might look for in future Queue?
Thomas:
What I really like about Queue is that it’s trying to bridge the academic world to the practitioner world. There are so many jewels inside the ACM and if they are not shared around the world what good is it. I think ACM Queue is a good way of getting your message out to them.
Ibaraki:
Do you have a sense of some of the megatrends that are out there that you think will shape our lives, world, destiny and work?
Thomas:
I think the biggest trend happening right now is DevOps which I realize means a lot of different things to different people, but let me give you this definition. It’s rapid release; it’s the ability to go from the developer who is writing code to getting into production very quickly, and instead of one big release once a year, doing weekly, monthly or daily releases so that new features get to users faster, but also there are more opportunities to learn. When you are vertically integrated you can be more dynamic. Being able to manage that dynamically and faster than your competition, you have to be vertically integrated. That’s mega-trend number two and a lot of that is enabled by DevOps philosophies and methodologies. I think the third mega-trend is having management recognize that every company is now a software company. The only way to beat the competition is to have the best people, and to hire the best people you need to have the best environment that is attractive. It doesn’t just mean having nice cafeteria, it means having the best managers, having an entire management team that is technical and understands software management as well as their own industry.
Ibaraki:
Agility is key today with start-ups and launching new products or services within a larger enterprise, and it’s pervasive even in larger enterprise, this start-up mentality. Do you have any recommendations as far as doing innovations within an enterprise or some of the key steps that lead to a successful start-up?
Thomas:
Being successful in a start-up is often different than innovating within a big company. I think that the successful start-ups have to be willing to try new things, be willing to pivot a lot, but have that down to not too much. I think a lot of what makes start-ups successful is being first to market and that means being incredibly agile. I think the difference in innovating at a big company is you need to be able to have an environment where you won’t get punished for failures. How do you encourage experimentation and a willingness to try new things?
Ibaraki:
You’ve worked in a variety of environments, in your opinion what are the key attributes or characteristics in individuals and teams that produce winning products and services?
Thomas:
I think what makes an operations team successful is they can’t have an oral tradition they have to have a written tradition, so you need to write things down because not everyone on the team can know everything. The second thing is work in small iterations. Don’t try to launch something big. The third thing is infrastructure as code and code as policy. So run code is better than a document – rather than saying we shouldn’t do x, y, z, you should have code that enforces it.
Ibaraki:
We are down to our last question and I’m going to give you a couple of choices here. Maybe you can share some stories from all of your travels, speaking and work or maybe there’s a question that you wished I’d asked and then what would be your answer? You choose.
Thomas:
I was working at a place and we had an incredibly complicated process which I’ll summarize by saying it was a process involved every time we worked in opening up a new office. My team was involved in just one aspect of that process. We were often the team that was late, didn’t get things done as expected or would run into strange problems. We took a step back and we said why are we having problems? Did we actually know what the process was? The team spent six months documenting just what the actual process was and what we discovered was that our part of the process had 24 steps that involved 15 teams and that’s crazy….I’m very proud of the team because it went from everything based on lore, to data, to code as documentation and policy. It made the whole process work so much better. It was a big effort and it typifies a lot of the projects that I’ve been on. I know that this isn’t a technical system administration technology solution. We didn’t buy a product, we didn’t adopt some new algorithm, this is all about making it work better through tenacious documentation and improvement.