Skip to content

Building Real Software - Jim Bird
Syndicate content
Building real software: developing secure software at the extreme limits of performance and reliability. Software project management for small teams.
Updated: 11 hours 20 min ago

Developers just don’t go to security conferences

Tue, 07/13/2010 - 21:59
OWASP has just announced their 2010 US Appsec conference. It looks like an interesting opportunity to explore the state of the art in software security. Last year, some of the leaders of this conference were concerned about how few software developers showed up for the sessions. I expect the same will happen this year: the audience will be made up of a self-referencing group of security specialists and consultants, and a handful of developers and managers who are looking to understand more about software security. And that, I think, is as it should be.

Travel and education budgets are tight. Developers and managers need to choose carefully where to spend their company’s money and time – or their own. Where can they get the most information for their own work, where can they meet people who will help them solve problems or move their careers forward?

Security experts like Jeremiah Grossman are right: developers don’t understand security, they aren’t taking ownership for building secure software, it’s not important to them.

What’s important to software developers? Delivery: if we don’t deliver, we fail. Check out the major software development conferences, the events that attract senior developers, architects, development managers, test managers, project managers. They are all about delivery, how to deliver faster, better, cheaper: Agile methods, understanding Lean/Kanban principles applied to software development, leadership and collaboration and communication and effecting organizational change, managing distributed teams and global development, getting requirements right, getting design right, tracking whether the project is on target, metrics, continuous integration, continuous delivery, continuous deployment, TDD and BDD, refactoring and improving code quality, improving the user experience, newer and better development platforms and languages and tooling.

With the notable exception of the recent NFJS UberConf, there isn’t any serious coverage of secure software development, secure SDLCs, software security problems at software development conferences. The question shouldn’t be why there are so few developers attending a software security conference. The question should be why there is so little coverage of software security at software development conferences and in the other places where developers and managers get their information: in the development-focused books and blogs and seminars.

Building Bridges

I’ve posted before about my concern about the gap between the software development and software security communities. But there is a way forward. Around 10 years ago, the development and testing community were far apart in values and goals; testing was inefficient and was seen as “somebody else’s problem”. But Agile development made it important for developers to test early and often, made it important for developers to understand testing and code quality, to find better and more efficient ways to test. Testing is cool now – and more than that, it’s expected. Developers look to professional testers for help in improving their own testing, and to find bugs that they don’t understand how to find, through exploratory testing and integration testing and more advanced testing techniques. The development community is taking responsibility for testing their own work, for the quality of their work. And I believe that software development, and software, are both better for it.

But software security is still “somebody else’s problem”. This needs to change. The solution isn’t to try to entice developers to attend security conferences. It's not to force certification in secure development through SANS or ISC. It’s not passive attempts to “infect” development managers with vague ideas about being "Rugged" that are supposed to somehow change how developers build software. And holding software producers liable for their mistakes, while clearly showing the frustration of security specialists, and while making for a provocative sound bite, is not likely to happen either.

The solution is to make secure software development a problem for software developers, a problem that we need to solve ourselves. Engage leaders in the wider software development community: the people who spend their lives thinking about and writing about better ways to build better software; the people who help shape the values and priorities of the development community; the people who help developers and managers decide where to spend time and effort and money. And help them to understand the problems and how serious these problems are, convince them that they need to be involved, convince them that we need to include security in software development, and ask them to help convince the rest of us.

Engage people like Steve McConnell, and Martin Fowler, and Kent Beck, Uncle Bob Martin, Michael Feathers, the NFJS uber geeks, Joel Spolsky and Mike Atwood, Scrum advocates like Mike Cohn and Ken Schwaber, Lean evangelists like Tom and Mary Poppendieck, David Anderson on Kanban, agile project managers like Johanna Rothman, and leaders from the software testing community like James Bach and Jonathan Kohl. And ask them who else they think should be engaged because they’ll know better who can make a difference.

Invite them to come in and work with the best of the software security community, to understand the challenges and issues in building secure software, ask them to consider how to “Build Security In” to software development. And with their help, maybe we will see software security problems owned by the people responsible for building software, and those problems solved with the help and guidance of experts in the security community in a supportive and collaborative way. Otherwise I am afraid that we will continue to see the communities drift apart, the gap between our priorities growing ever wider. And software, and software development, will not be the better for it.
Categories: Blogs

Still getting my head around Continuous Deployment

Wed, 06/30/2010 - 22:45
I attended a long webinar earlier today, sponsored by SD Times: Kent Beck’s Principles of Agility. The other speakers were Jez Humble from ThoughtWorks, a proponent of Continuous Delivery; and Timothy Fitz at IMVU, the leading evangelist for Continuous Deployment.

The arguments in support of Continuous Deployment

Kent Beck explored a fundamental mismatch between rapid cycling in design and construction, and then getting stuck when we are ready to deploy. He argues that that queuing theory and experience show that there is more value in a system when all of the pipes are the same size, and follow the same cycle times. Ideally, there should be a smooth flow from ideas to design and development and to deployment, and then information from real use fed back as soon as possible to ideas. Instead we have a choke point at deployment.

Then there is the ROI argument that we can get faster return on money spent if we deploy something that we have done as soon as it is ready.

Kent Beck also explained that based on his experience at one company the constraints of deploying immediately make people more careful and thoughtful: that the practice becomes self-reinforcing, that developers stop taking risks because they don’t have time to. Essentially problems become simpler because they have to be.

Timothy Fitz presented a Deployment Equation:

If Information Value + Direct Value > Deployment Risk then Deploy

The idea is that Continuous Deployment increases information value by giving us information earlier. He talked about ways to reduce risk:

- Rolling out larger changes slowly to customers, through dark launching (hiding the changes from the front-end until ready: not exactly a new idea) and enabling features for different sets of users.
- Extensive automated testing, supplemented with manual exploratory testing before exposing dark-launched features.
- Ensuring that you can detect problems quickly and correct them through production monitoring, looking for leading indicators of problems, and instant production roll back.
- An architecture that supports stability through isolation. Follow the patterns in Release It! to minimize the chance of “stupid take the cluster out” errors.
- Locking down core infrastructure, preventing changes from certain parts of the system without additional checks.

Jez Humble at ThoughtWorks presented on Continuous Delivery: building on top of Continuous Integration to automate and optimize further downstream packaging and deployment activities. Continuous Deployment is effectively an extension of Continuous Delivery. It was mostly a re-hash of another presentation that I had already seen from ThoughtWorks, and of course there will be a book coming out soon on all of this.

Some questions on Continuous Delivery and Continuous Deployment

Me: Continuous Delivery is based on the assumption that you can get immediate feedback: from automated tests, from post-deployment checks, from customers. How do you account for problems that don't show up immediately, by which time you have deployed 50 or 100 or more changes?

Answer from Timothy Fitz: The first time, you revert and re-push. Then you post-mortem and figure out how to catch faster by looking for a leading indicator. Performance issues can be caught by dark launching, in which case turning off or reverting the functionality will have 0 visible effect. Frontend issues are usually caught by A/B tests, where you can mitigate risk by not running them at 100% of all traffic (have 80% control, 20% hypothesis, etc)

Me: Followup on my question about handling problems that show after 50 or 100 changes. The answer was to revert and re-push - but revert what? A problem may not show itself immediately. How do you know which changes or changes to rollback?

Answer from Timothy Fitz: If it took 50-100 changes, then you'll be finding the change manually. It turns out to be fairly easy even if it's been 48-96 hours, you're only looking through a few hundred very small commits most of which are in isolated areas unrelated to your problem.

Me: How to you handle changes to data (contents and/or schema) on a continuous basis?

Answer: not answered. Jez Humble talked about writing code that could work with multiple different database versions (which would make design and testing nasty of course), and how to automate some database migration tasks with tools like DBDeploy, but admitted that “databases were not optimized for Continuous Delivery”. There were no good answers on how to handle expensive data conversions.

Me: My team has obligations to ensure that the software we deliver is secure, so we follow secure SDLC checks and controls before we release. In Continuous Delivery I can see how this can be done along the pipeline. But secure Continuous Delivery?

Answer from Jez Humble: Ideally you'd want to run those checks against every version. If you can't do that, do it as often as you can.
[I didn’t expect a meaningful answer on this one, and I didn’t get one]

Somebody else’s question: Do you find users struggling to keep up and adapt to the constant changes?

Answer from Kent Beck: In practice it doesn't seem to be a problem usually because each change is small--a new widget, a new menu item, a new property page that's similar to existing pages. A wholesale change to the UI would be a different story. I would try to use social processes to support such a change--have a few leaders try the new UI first, then teach others.

Somebody else’s question: Without solid continuous testing in place, CD is [a] fast track to continuous complaints from end users

Answer from Timothy Fitz: Not always, but usually. For the cases where it makes sense (small startup, or isolated segment that opts-in to alpha) you can find user segments who value features 100% over stability, and will gladly sign up for Continuous Deployment.

So what do I really think about Continuous Deployment

OK I can see how Continuous Deployment can work,

If: your architecture supports isolation, that it is horizontal and shallow, offering features that are clearly independent;

If: you don’t follow the all-or-none approach – that you recognize that some kinds of changes can be deployed continuously and some parts of the system are too important and require additional checks, tests, reviews, and more time;

If: you build up enough trust across the company;

If: your customers are willing to put up with more mistakes in return for faster delivery, if at least some of them are willing to help you do your testing for you;

If: you invest enough in tools and technology for automated layered testing and deployment and post-deployment checking and roll-back capabilities.

Continuous Deployment is still an immature approach and there are too many holes in it. And as Kent Beck has pointed out, there aren’t enough tools yet to support a lot of the ideas and requirements: you have to roll your own, which comes with its own costs and risks.

And finally, I have to question the fundamental importance of immediate feedback to a company. I can see that waiting a year, or even a month, for feedback can be too long. I fully understand and agree that sometimes changes need to be made quickly, that sometimes the windows of opportunity are small and we need to be ready immediately. And there’s first mover advantage, of course. But I have a hard time believing that any kind of changes need to be continuously made 50 times per day: that there are any changes that can be made that quickly that will have any real difference to customers or to the business. And I will go further and say that such rapid changes are not in the interests of customers, that they don’t need or even want this much change this fast. And that I don’t believe that it’s really about reducing waste, or maximizing velocity or increasing information value.

No, I suspect it is more about a need for immediate satisfaction – for programmers, and the people who drive them. Their desire to see what they’ve done get into production, and to see it right away, to get that little rush. The simple inability to delay gratification. And that’s not a good reason to adopt a model for change.
Categories: Blogs

Velocity 2010 Conference Take-Aways

Mon, 06/28/2010 - 20:05
I spent an interesting few days last week at the Velocity 2010 conference in Santa Clara. The focus of the conference was on performance and application operations for large-scale web apps. Here are my take-aways:

Performance

Fundamentally a problem of scale-out, of handling online communities of millions of users and the massive amounts of information that they want at hand. As Theo Schlossnagle pointed out in an excellent workshop on Scalable Internet Architectures (or you can read the book
), the players in this space approach performance problems with similar technologies (LAMP or something similar like Ruby on Rails as the principal stack, and commodity servers) and architectural strategies:

1. Data partitioning – sharding datasets across commodity servers, required because MySQL does not scale vertically. Theo’s advice on sharding: “Avoid sharding as long as possible, it is painful. If you have to shard, follow these steps. Step 1: Shard your data. Step 2: Shoot yourself”. Consider duplicating data if you need the same information available in different partitioning schemes.

2. Non-ACID key-value data stores and NOSQL distributed data managers like Cassandra, MongoDB, Voldemort, Redis or CouchDB for handing high volumes of write-intensive data. Fast and simple, but these technologies are still immature, they are not hardened or reliable, and they lack the kinds of management capabilities and tools that Oracle DBAs have been accustomed to for years.

3. Strategies for effective caching of high-volume data, basically ways of extending and optimizing the use of memcached, and different schemes for effective cache consistency and cache coherency.

Some other advice from Theo: Planning for more than a 10-fold increase in workload is a waste of time – you won’t understand the type of problems that you are going to face until you get closer. On architecture and design: don’t simplify simple problems.

Coming from a financial trading background, I was surprised to see that the argument still needed to be made that performance was an important business factor: that speed could improve business opportunities. Seems obvious.

According to one of the keynote speakers, Urz Holzle at Google, the average time for a page to load is 4.9 seconds, while the goal should be around 100 ms – the time that it takes a reader to turn a page in a book. Google presented some interesting research work that they are leading to improve the front-end response time of the web experience, including proposals to improve DNS and TCP, work done in Chrome to improve browser performance, and advanced performance profiling tools made available to the community.

Operations and DevOps

Provisioning and deployment (a real management problem when you need to deploy to thousands or tens of thousands of servers); change management and the rate of change; version control and other disciplines; instrumentation and logging; metrics and more metrics; and failure handling and incident management.

Log and measure as much as you can about the application and infrastructure – establish baselines, understand Normal, understand how the system looks when it is working correctly and is healthy.

Configuration management and deployment. Advice from Theo: version control everything – not just code and application configuration, and server configs, but also the configs for firewalls and load balancers and switches/routers and the database schemas and


Several companies were using Chef or Puppet for managing configuration changes. Facebook and Twitter were both using BitTorrent to stream code updates across thousands of servers.

Change management. The consensus is that ITIL is very uncool – it is all about being slow and bureaucratic. This is a shame – I think that everyone in an operations role could learn from the basics of ITIL and Visible Ops, the disciplines and frameworks.

The emphasis was on how to effect rapid change, how to get feedback as quickly as possible, time to market, continuous prototyping, A/B split testing to understand customer needs, the need to make decisions quickly and see results quickly. At the same time, different speakers stressed the need for discipline and responsibility and accountability: that the person who is responsible for making a change should make sure it gets deployed properly, and that it works.

Continuous Deployment came up several times, although “Continuous” means different things to different people. For Facebook this means pushing out small changes and patches every day and features once per week.

You can’t make changes without taking on the risk of failure. This was especially clear to an audience where so many people had experience in startups.

Lenny Rachitsky’s session, The Upside of Downtime, covered the need for transparency in the event of failure, and showed how being transparent and honest in the event of a failure can help build customer confidence. His blog, Transparent Uptime includes an interesting collection of Public Health Dashboards for web communities.

To succeed you need to learn from failures of course – use postmortems and Root Cause Analysis to understand what happened and implement changes so that you don’t keep making the same mistakes. Another quote from Theo: “Good judgment comes from experience. Experience comes from bad judgment. Allow people to make mistakes – but limit the liability. Measure the poise and integrity with which someone deals with the problem and its remediation.”

So failure can scorch you, make you afraid, and this fear can affect your decision making, slow you down, stop you from taking on necessary and manageable risks. You need to know how much risk you can take on, whether you are going too slow or too fast, and how to move forward.

John Allspaw at Etsy, one of the rock stars of the devops community, made a clear and compelling (and entertaining) case for meta-metrics, data to provide confidence in your operational decisions: “How do we get confidence in [what we are doing]? We make f*&^ing graphs!”

First track all changes: who made the change, when, what type, and how much was changed. Track all incidents: severity, time started, time to detect, time to recover/resolve, and the cause (determined by RCA). Then correlate changes with incidents: by type, size, frequency. With this you can answer questions like: What type of incidents have high Time to Recover? What types of changes have high / low success rates?

Unfortunately the video and slide deck for this presentation are not available on the Velocity site yet.

There was some macho bullshit from one of the speakers about “failing forward” – that essentially rolling back was for cowards. I think this statement was made tongue-in-cheek and I hope that it was taken as such by the audience.

The Rest

I also followed up some more on Cloud Computing. Sure, the Cloud gives you cheap access to massive resources but the consensus at the conference was that it is still not reliable and it is definitely not safe, and it doesn’t look like it will get that way soon. Any data that you need to be safe or confidential needs to be kept out of the Cloud or at minimum encrypted and signed with the keys and other secrets stored out of the Cloud, following a public/private data architecture.

The conference was fun and thought-provoking, and I met a lot of smart and thoughtful people. The crowd was mostly young and attention-deficit: iphones, ipads, notebooks and laptops in constant use throughout the sessions.

Maybe it was the California sunshine, but the atmosphere was more open, more sharing, and less proprietary than I am used to – there was a refreshing amount of transparency into the technology and operations at many of the companies. The vendor representation was small and low key, but recruitment was blatant and pervasive: everyone was hunting for talent.

I am an uptight enterprise guy. It would be fun to work on large-scale consumer problems, with more freedom to make changes. I regret missing the followon DevOps Days event last Friday but I had to get home. And finally, I am looking forward to getting my copy of the new WebOps book
which was premiered at the conference, and to next years Velocity conference.
Categories: Blogs

Fast or Secure Software Development - but you can't have both

Mon, 05/24/2010 - 23:24
There is a lot of excitement in the software development community around Lean Software Development: applying Japanese manufacturing principles to software development to reduce waste and eliminate overhead, to streamline production, to get software out to customers and get their feedback as quickly as possible.

Some people are going so far as to eliminate review gates and release overhead as waste: “iterations are muda”. The idea of Continuous Deployment takes this to the extreme: developers push software changes and fixes out immediately to production, with all the good and bad that you would expect from doing this.

Continuous Deployment has been followed to success at IMVU and Wordpress and Facebook and other Web 2.0 companies. The CEO of Automattic, the team behind Wordpress, recently bragged about their success in following Continuous Deployment:
“The other day we passed product release number 25,000 for WordPress.com. That means we’ve averaged about 16 product releases a day, every day for the last four and a half years!”
I am sure that he is not proud of their history of security problems however, which you can read about here, here, here, here, here and elsewhere.

And Facebook? You can read about how they use Continuous Deployment practices to push code out to production several times a day. As for their security posture, Facebook has "faced" a series of severe security and privacy problems and continues to run into them, as recently as last week.

I’ve ranted before about the risks that Continuous Deployment forces on customers. Continuous Deployment is based on the rather naïve assumption that if something is broken, if you did something wrong, you will know right away: either through your automated tests or by monitoring the state of production, errors and changes in usage patterns, or from direct customer feedback. If it doesn’t look like it’s working, you roll it back as soon as you can, before the next change is pushed out. It’s all tied to a direct feedback loop.

Of course it’s not always that simple. Security problems don’t show up like that, they show up later as successful exploits and attacks and bad press and a damaged brand and upset customers and the kind of mess that Facebook is in again. I can’t believe that the CEO of Facebook appreciates getting this kind of feedback on his company's latest security and privacy problems.

Now, maybe the rules about how to build secure and reliable software don’t, and shouldn’t, all apply to Web 2.0, as Dr. Boaz Gelbord proposes:
“Facebook are not fools of course. You don't build a business that engages every tenth adult on the planet without honing a pretty good sense for which way the wind is blowing. The company realizes that it is under no obligation to provide any real security controls to its users.
Maybe to be truly open and collaborative, you are obliged to make compromises on security and data integrity and confidentiality. Some of these Web 2.0 sites like Facebook are phenomenally successful, and it seems that most of their customers don’t care that much about security and privacy, and as long as you haven’t been foolish enough to use tools like Facebook to support your business in a major way, maybe that’s fine.

And I also don’t care how a startup manages to get software out the door. If Continuous Deployment helps you get software out faster to your customers, and your customers are willing to help you test and put up with whatever problems they find, if it gives you a higher chance of getting your business launched, then by all means consider giving it a try.

Just keep in mind that some day you may need to grow up and take a serious look at how you build and release software – that the approach that served you well as a startup may not cut it any more.

But let’s not pretend that this approach can be used for online mission-critical or business-critical enterprise or B2B systems, where your system may be hooked up to dozens or hundreds of other systems, where you are managing critical business transactions. Enterprise systems are not a game:
“I understand why people would think that a consumer internet service like IMVU isn't really mission critical. I would posit that those same people have never been on the receiving end of a phone call from a sixteen-year-old girl complaining that your new release ruined their birthday party. That's where I learned a whole new appreciation for the idea that mission critical is in the eye of the beholder.”This is a joke right?

But seriously, I get concerned when thoughtful people in the development community, people like Kent Beck and Michael Feathers start to explore Continuous Deployment Immersion and zero-length iterations. These aren’t kids looking to launch a Web 2.0 site, they are leaders who the development community looks to for insight, for what is important and right in building software.

There is a clear risk here of widening the already wide disconnect between the software development community and the security community.

On one side we have Lean and Continuous Deployment evangelists pushing us to get software out faster and cheaper, reducing the batch size, eliminating overhead, optimizing for speed, optimizing the feedback loop.

On the other side we have the security community pleading with us to do more upfront, to be more careful and disciplined and thoughtful, to invest more in training and tools and design and reviews and testing and good engineering, all of which adds to the cost and time of building software.

Our job in software development is to balance these two opposing pressures: to find a way to build software securely and efficiently, to take the good ideas from Lean, and from Continuous Deployment (yes, there are some good ideas there in how to make deployment more automated and streamlined and reliable), and marry them with disciplined secure development and engineering practices. There is an answer to be found, but we need to start working on it together.
Categories: Blogs

Code quality, refactoring and the risk of change

Thu, 05/20/2010 - 22:41
When you are working on a live, business-critical production system, deciding what work needs to be done, and how to do it, you need to consider different factors:
  1. business value: the demand and urgency for new business features and changes requested by customers or needed to get new customers onboard.

  2. compliance: ensuring that you stay onside of regulatory and security requirements.

  3. operational risk and safety: the risks of injecting bugs into the live system as a side-effect of your change, the likelihood and impact of errors on the stability or usability of the system.

  4. cost: immediate development costs, longer-term maintenance and support costs, operational costs and other downstream costs, and opportunity costs of choosing one work item over another or one design approach over another.

  5. technical: investments needed to upgrade the technology stack, managing the complexity and quality of the design and code.
These factors also come into play in refactoring: deciding how much code to cleanup when making a change or fix. The decision of what, and how much to refactor, isn’t simple. It isn’t about a developer’s idea of what is beautiful and their need for personal satisfaction. It isn’t about being forced to compromise between doing the job the right way vs. putting in a hack. It’s much more difficult than that. It’s about balancing technical and operational risks, technical debt, cost factors, and trading off short-term advantages for longer-term costs and risks.

Let’s look at the longer term considerations first. A live system is going to see a lot of changes. There will be regulatory changes, fixes and new features, upgrades to the technology platform. There will also be false starts and back-tracking as you iterate, and changes in direction, and short-sighted design and implementation decisions made with insufficient time or information. Sometimes you will need to put in a quick fix, or cut-and-paste a solution. You will need to code-in exceptions, and exceptions to the exceptions, especially if you are working on an enterprise system integrated with tens or hundreds of other systems. People will leave the team and new people will join, and everyone’s understanding of the domain and the design and the technology will change over time. People will learn newer and better ways of solving problems and how to use the power of the language and their other tools; they will learn more about how the business works; or they might forget or misunderstand the intentions of the design and wander off course.

These factors, the accumulation of decisions made over time will impact the quality, the complexity, the clarity of the system design and code. This is system entropy, as described by Fred Brooks back in the Mythical Man Month:
“All repairs tend to destroy the structure, to increase the entropy and disorder of the system. Less and less effort is spent on fixing the original design flaws: more and more is spent on fixing flaws introduced by earlier fixes
 Sooner or later the fixing ceases to gain any ground. Each forward step is matched by a backward one.”
So, the system will become more difficult and expensive to operate and maintain, and you will end up with more bugs and security vulnerabilities – and these bugs and security holes will be harder to find and fix. At the same time you will have a harder time keeping together a good team because nobody wants to wade knee deep in garbage if they don’t have to. At some point you will be forced to throw away everything that you have learned and all the money that you have spent, and build a new system. And start the cycle all over again.

The solution to this is of course is to be proactive: to maintain the integrity of the design by continuously refactoring and improving the code as you learn, fill-in short-cuts, eliminate duplication, clean up dead-ends, simplify as much as you can. In doing this, you need to balance the technical risks and costs of change in the short-term, with the longer-term costs and risks of letting the system slowly go to hell.

In the short term, we need to understand and overcome the risk of making changes. Michael Feathers, in his excellent book Working Effectively with Legacy Code talks about the fear and risk of change that some teams face:
“Most of the teams that I’ve worked with have tried to manage risk in a very conservative way. They minimize the number of changes they make to the code base. Sometimes this is a team policy: ‘if it’s not broke, don’t fix it’
. ‘What? Create another method for that? No, I’ll just put the lines of code right here in the method, where I can see them and the rest of the code. It involves less editing, and it’s safer.’

It’s tempting to think we can minimize software problems by avoiding them, but, unfortunately, it always catches up with us. When we avoid creating new classes and methods, the existing ones grow larger and harder to understand. When you make changes in any large system, you can expect to take a little time to get familiar with the area you are working with. The difference between good systems and bad ones is that, in the good ones, you feel pretty calm after you’ve done that learning, and you are confident in the change you are about to make. In poorly structured code, the move from figuring things out to making changes feels like jumping off a cliff to avoid a tiger. You hesitate and hesitate.

Avoiding change has other bad consequences. When people don’t make changes often they get rusty at it
The last consequence of avoiding change is fear. Unfortunately, many teams live with incredible fear of change and it gets worse every day. Often they aren’t aware of how much fear they have until they learn better techniques and the fear starts to fade away.”
It’s clear that avoiding changes won’t work. We need to get and keep control over the situation, we need to make careful and disciplined changes. And we need to protect ourselves from making mistakes.

Back to Mr. Feathers:
“Most of the fear involved in making changes to large code bases is fear of introducing subtle bugs; fear of changing things inadvertently”.
The answer is to ensure that you have a good testing safety net in place (from Michael Feathers one more time):
“With tests, you can make things better with impunity
 With tests, you can make things better. Without them, you just don’t know whether things are getting better or worse.”
You need enough tests to ensure that you understand what the code does; and that you target tests that will will detect changes in behavior in the area that you want to change.

Put in a good set of tests. Refactor. Review and verify your refactoring work. Then make your changes and review and verify again. Don’t change implementation and behavior at the same time.

But there are still risks – you are making changes, and there are limits to how much protection you can get from developer testing, even if you have a high level of coverage in your automated tests and checks. Ryan Lowe reinforces this in “Be Mindful of Code Entropy”:
“The reason you don’t want to refactor established code: this ugly working code has stood the test of time. It’s been beaten down by your users, it’s been tested to hell in the UI by manual testers and QA people determined to break it. A lot of work has gone into stabilizing that mess of code you hate.

As much as it might pain you to admit it, if you refactor it you’ll throw away all of the manual testing effort you put into it except the unit tests. The unit tests
 can be buggy and aren’t nearly as comprehensive as dozens/hundreds/thousands of real man hours bashing against the application.”
As with any change, code reviews will help find mistakes, and so will static analysis. We also ask our developers to make sure that the QA team understands the scope of their refactoring work so that they can include additional functional and system regression testing work.

Beyond the engineering discipline, there is the decision of how much to refactor. So as a developer, how do you make this decision, how do you decide how much is right, how much is necessary? There is a big difference between minor, in-phase restructuring of code and just plain good coding; and fundamental re-design work, what Martin Fowler and Kent Beck call “Big Refactorings” which clearly need to be broken out and done as separate pieces of work. The answer lies somewhere in between these points.

I recently returned from a trip to the Buddhist kingdom of Bhutan, where I was reminded of the value of finding balance in what we do, the value of following the Middle Way. It seems to me that the answer is to do “just enough”. To refactor only as much as you need to make the problem clear, to understand better how the code works, to simplify the change or fix
 and no more.

By doing this, we still abide by Bob Martin’s Boy Scout Rule and leave the code cleaner than when we checked it out. We help protect the future value of the software. At the same time we minimize the risk of change by being careful and disciplined and patient and humble. Just enough.
Categories: Blogs

Why isn't Risk Management included in Scrum?

Fri, 05/07/2010 - 17:17
There are a lot of fundamentally good ideas in Scrum, and a lot of teams seem to be having success with it, on more than trivial projects. I’m still concerned that there are some serious weaknesses in Scrum, especially “out of the box”, where Scrum leaves you to come up with your own engineering framework, asks you to come up with your own way to build good software.

I am happy to see that recent definitions (or reinterpretations) of Scrum, for example in Mike Cohn’s book Succeeding with Agile Software Development Using Scrum incorporate many of the ideas from Extreme Programming and other basic good management and engineering practices to fill in some of the gaps.

But I am not happy to see that risk management is still missing in Scrum. There is no discussion of risk management in the original definition of Scrum, or in the CSM training course (at least when I took it), the latest Scrum Guide (although with all of the political infighting in the Scrum community, I am not sure where to find the definitive definition if such a thing exists any more), not even in Mike Cohn’s new book.

I just don’t understand this.

I manage an experienced development and delivery team, and we follow an incremental, iterative development approach based on Scrum and XP. These are smart, senior people who understand what they are doing, they are talented and careful and disciplined, and we have good tools and we have a lot of controls built in to our SDLC and release processes. We work closely with our customer, we deliver software often (every 2-3 weeks to production, sometimes faster), we continuously review and learn and get better. But I still spend a lot of my time watching out for risks, trying to contain and plan for business and technical and operational risks and uncertainties.

Maybe I worry too much (some of my colleagues think so). But not worrying about the risks won’t make them go away. And neither will following Scrum by itself. As Ken Schwaber, one of the creators of Scrum, admits:“Scrum purposefully has many gaps, holes, and bare spots where you are required to use best practices – such as risk management.”
My concern is that this should be made much more clear to people trying to follow the method. The contrast between Extreme Programming and Scrum is stark: XP confronts risk from the beginning, and the idea and importance of managing project and technical risks is carried throughout XP. Kent Beck, and later others including Martin Fowler, examine the problems of managing risk in building software, prioritizing work to take into account both business value and technical risk (dealing with the hard problems and most important features as early as possible), and outlining a set of disciplines and controls that need to be followed.

In Scrum, some management of risks is implicit, essentially burned-in to the approach of using short time-boxed sprints, regular meetings and reviews, more efficient communications, and a self-managing team working closely with the customer, inspecting and adapting to changes and new information. I definitely buy into this, and I’ve seen how much risk can be managed intrinsically by a good team following incremental, iterative development.

As I pointed out earlier, Scrum helps to minimize scope, schedule, quality, customer and personnel risks in software development through its driving principles and management practices.

But this is not enough, not even after you include good engineering practices from XP and Microsoft’s SDL and other sources. Some leaders in the Agile community have begun to recognize this, but there is an alarming lack of consensus within the community on how risks should be managed:

- by the team, intrinsically: like any other requirement or issue, members of the self-managing team will naturally recognize and take care of risks as part of their work. I’m not convinced that a development team is prepared to naturally manage risks, even an experienced team – software development is an essentially an optimistic effort, you need to believe that you can create something out of nothing, and most of the team’s attention will naturally be on what they need to do and how to best get it done rather than trying to forecast what could go wrong and how to prepare for it;

- by the team, explicitly: the team owns risk management, they are responsible for identifying and managing risks using a risk management framework (risk register, risk assessment and impact analysis, and so on);

- by the ScrumMaster, as another set of issues to consider when coaching and supporting the team, although how the ScrumMaster then ensures that risks are actually addressed is not clear, since they don’t control what work gets done and when;

- by the Product Owner: this idea has a lot of backing, supposedly because the Product Owner acting for the customer in the end understands the risk of failure to the business best. The real reason is because the team can resign themselves of the responsibility for risk management and hand it off to the customer. This makes me angry. It is unacceptable to suggest that someone representing the customer should be made responsible for assuming the risk for the project. The customer is paying your organization to do a job, and naturally expecting you as the expert to do this job professionally and competently. And yet you are asking this same customer to take responsibility for the risk that the project won’t be delivered successfully? Try selling that to any of the enterprise customers that I have worked with over the past 15 years


Risk management for a Scrum project needs to take this into account, and take into account all of the risks and issues that are external to the project team:

- the political risks involved in the team’s relationship with the Product Owner, and the Product Owner’s position and influence within their own organization. Scrum places too much responsibility on the Product Owner and requires that this person is not only competent and committed to the project and deeply understands the needs of the business; but that they also always put the interests of the business ahead of their own, and of course this assumes that they are not in conflict with other important customer stakeholders over important issues. They also need to understand the importance of technical risk, and be willing to trade-off technical risk for business value in planning and scheduling work in the backlog. I am lucky in that I have the opportunity to work with a Product Owner who actually meets this profile. But I know how rare and precious this is;

- the political risks of managing executive sponsors and other key stakeholders within the customer and within your own organization - it's naive to think that delivering software regularly and keeping the Product Owner happy is enough to keep the support of key sponsors;

- financial and legal risks – making sure that the project satisfies the legal and business conditions of the contract, dealing with intellectual property and confidentiality issues, and financial governance;

- subcontractors and partners – making sure that they are not going to disappoint or surprise you, and that you are meeting your commitments to them;

- infrastructure, tooling, environmental factors – making sure that the team has everything that they need to get the job done properly;

- integration risks with other projects - requirements for program management, coordinating and communicating with other teams, managing interdependencies and resource conflicts, handling delays and changes between projects;

- rollout and implementation, packaging, training, marketing – managing the downstream requirements for distribution and deployment of the final product.

So don’t throw out that PMP.

It’s clear that you need real and active project management and risk management on Scrum projects, like any project: focused not so much inwards, but outwards. So much of what you read about Scrum, and Agile project management generally, is naive in that it assumes that the development work is being done in a vacuum, that there is nothing that concerns the project outside the immediate requirements of the development team. Even assuming that the team is dealing effectively with technical risks, and that other immediate delivery risks are contained by being burned-in to doing the work incrementally and following engineering best practices, you still have to monitor and manage the issues outside of the development work, the larger problems and issues that extend beyond the team and the next sprint, the issues that can make the difference between success and failure.
Categories: Blogs

XP and the Art of Software Maintenance

Fri, 04/23/2010 - 00:34
Most of us in software development will spend most of our careers maintaining and supporting software, not just working on new, greenfield projects. Maintenance is brownfield work, it’s hard work under difficult and sometimes unfair conditions and strict constraints. It’s not glamorous, and it takes discipline and skill and commitment to succeed over the long term.

There are a number of challenges in successfully maintaining a system, especially a big business-critical system, a system that has been around for a while, that represents the work of many people over many years.

There are the technical challenges in safely managing change, minimizing technical and operational risk, recognizing and containing technical debt, understanding and working with code that you did not have a hand in writing (and whose author has long left the company), testing and installing upgrades to ensure that the technology stack does not become obsolete, and keeping up with the changing security threat landscape. Setting and maintaining a high level of quality, continuously reviewing and refactoring the design and implementation, ensuring that you don’t let entropy set in and fall into the trap of having to re-write the system and lose all of the work that has already been done – what Joel Spolsky calls “the single worst strategic mistake that any software company can make”.

And there are challenges in managing the team, keeping the strongest possible team together for as long as possible, sustaining momentum and commitment and engagement over time, making the work interesting and worthwhile and important and fun.

As I described in a previous post Everything I needed to know about Maintenance, I learned how a company could be successful over a long period of time maintaining and supporting the same software, continuously focusing on delivering new value to customers and pushing for technical excellence, all done by a small senior team following an incremental development approach. I have tried to apply these ideas at my current firm, where we have been supporting and maintaining a business-critical financial application for more than 3 years now.

Many of these same ideas and practices, and more, are captured in today’s agile development methods: XP, in particular, is an engineering and management framework that is especially well-suited for software maintenance. XP provides a foundation for maintenance through:

Frequent, small releases to production: always be delivering, creating a sense of accomplishment for the team and continuously delivering value to the customer, responding to change, and learning from feedback.

A constant focus on quality: “No defect is acceptable, each is an opportunity for the team to learn and improve”.

Making change safe through automated developer testing, building a regression safety net.

Continuous integration: always knowing that the code works.

Close contact with the customer (the business side and operations both).

Lightweight, continuous communications, with just enough documentation.

Refactoring: constantly improving code and design – critical in recognizing and reducing technical debt.

Slack and sustainable pace: giving people time to do their work, preventing burnout, and ensuring that you can fit in other work demands, especially important in maintenance because you can’t control requirements for support and firefighting.

Collective code ownership, skilling-up and skilling-out the team: in maintenance, by necessity sooner or later just about everyone is going to touch just about every piece of code.

Just enough design, breaking problems down and finding the simplest possible solution – this is a controversial aspect of XP when building enterprise systems from scratch, but it is the right approach in maintenance where you are dealing with incremental problems and the constraints of existing architecture and technology.

Transparency and respect – creating and maintaining an open, trusting and respectful environment within the team, with operations and with the customer.

Now, in our case, I don’t mean 100%, full-on, hardcore, literally by-the-book XP: I mean a dialed-down implementation of Extreme Programming, a less-Extreme Extreme Programming. As Kent Beck states in Extreme Programming Explained: Embrace Change
"The values, principles and practices are there to provide guidance, challenge and accountability
 The goal is successful and satisfying relationships and projects, not membership in the ‘XP Club’."
We have followed his guidance, to
“Experiment with XP using these practices as your hypotheses”
and we have adapted the ideas and principles of XP to our situation, our experience and way of working.

XP, by Kent Beck’s admission, is an integrated set of good engineering and management practices, dialed up to 10. In adapting and integrating the practices in XP for maintenance, we have dialed back in specific areas:

Pair Programming

All of our code changes are reviewed before being released. The idea behind pairing is that if code reviews are good, continuous code reviews are better. Like my friend Pete McBreen in his post Still Questioning Extreme Programming I think that pairing makes sense in specific cases, especially in troubleshooting and helping people new to the team, but it is not necessary all the time – our developers pair up when it makes sense.

Test First Development

We rely extensively on automated testing, testing early and testing often. Our developers choose to follow TFD or TDD practices, or write tests after the code is written, as they see fit, following the principle that
“code and testing can be written in either order
 write tests in advance when possible”.
We put a lot of emphasis on testing, on automating developer testing – this is another area where XP is in agreement:
“in XP testing is as important as programming”.
This is a critical idea in maintenance, where even small changes to an existing system can have significant consequences.

Unlike some “pure” XP teams, we don’t rely only on the automated test suites: we have a senior team of testing specialists who conduct exploratory and destructive testing, system and integration testing, operational testing; we schedule regular application penetration tests; we run system trials and simulations with the team involved in interactive, loosely structured “war games”; and we take advantage of technologies such as static analysis in our Continuous Integration environment. Our automated regression testing safety net is an important asset, but we find valuable, higher-risk problems in exploratory testing, reviews and in the war games simulations.

Incremental Delivery

We follow a model closer to the first edition description of XP, releasing to production every 2-3 weeks rather than holding to a 1-week cycle – extremely short, rapid cycling isn’t sustainable, and doesn’t allow enough time for the reviews and other checkpoints that we have found necessary and valuable.

The Development/Maintenance Problem

In many organizations there are “developers” and “maintainers”: a team of hotshots is hired to design and build the initial system, and once the “hard work” is done, they hand it off to the maintenance and support or “sustained engineering” crew: kids and old-timers and other misfits who don’t have what it takes (yet, or any more) to do “real software development”; or the work of maintenance and support is offshored to a team in India or Eastern Europe to save costs.

But this is not the case in XP, as Pete McBreen points out in Questioning Extreme Programming
"The interesting thing about XP, however, is that it assumes that applications are never really going to be handed off to a separate maintenance team. The assumption is that after each incremental release, the customer will want more functionality and keep funding the development team
 As such, there is never a need to hand the application over to a maintenance team; the original development team can continue to support the application indefinitely”.This idea of keeping the team together, preserving the knowledge that has been built up over time, the team’s deep and shared understanding of the domain, their proven ability to deliver, is critical and fundamental. This is the real intellectual property, the real value that you have created. Erich Brechner at Microsoft explores this in his post on Sustained engineering idiocy where he shows that keeping the development team engaged in maintenance and support builds accountability, creates a deeper understanding of the system and of the customer’s needs, and informs future development, using the feedback from support to improve the quality and reliability of future releases or future products. Eric provides some useful ideas on how to balance the requirements of maintenance and support against future development: structuring your team around a core with an evaluation team that investigates and triages issues, and using backlog management to feed fixes into the incremental development schedule.

Most of the work that will be done on a piece of software, up to 70% in some studies, is done during maintenance. It just makes sense to have your best people working on what is most important: protecting the investment that you and your customer made in building the software in the first place; supporting the ongoing business operations of your customers; and ensuring that you and your customers will continue to succeed in the future.
Categories: Blogs