How to survive the collapse of the system in Your firm

  • How to survive the collapse of the system in Your firm
    Independent.t. E.
    Beginning last Saturday, I was awakened by an urgent message from a team member: New user from one of our clients the court was not able to log in and was blocked to do some urgent work. In the industry of events, the timing is absolutely rigid and cannot be pushed – measures should be taken at the advertised time and place. Our vision is to make life easier for professionals of the event, not harder, blocking them from our system.
    https://www.independent.ie/business/technology/how-to-survive-a-system-crash-at-your-firm-37242259.html
    https://www.independent.ie/business/technology/article37242258.ece/c9084/AUTOCROP/h342/2018-08-23_bus_43483340_I1.JPG

  • Email

Beginning last Saturday, I was awakened by an urgent message from a team member: New user from one of our clients the court was not able to log in and was blocked to do some urgent work. In the industry of events, the timing is absolutely rigid and cannot be pushed – measures should be taken at the advertised time and place. Our vision is to make life easier for professionals of the event, not harder, blocking them from our system.

The reality of starting a business on the Internet is that your website and services will inevitably crash. It happens to the best in the business. Amazon, Google, Apple, Facebook, and others had their moments. You can’t plan for all occasions – is the datacenter Amazon got lightning strike in 2012. No amount of training disaster is going to solve this problem.

However, you should plan to fail. You must figure out how you are going to deal with failure. I put in place of the system control outage earlier, but we only started this work in voxgig. When it happens to you, there are three things you need to do for hours, days and weeks.

First, when a problem occurs, it is necessary not to lose your head. This is the biggest mistake you can make as a leader. If you start to run and scream at their staff, you will be able to make it much more difficult to solve the problem. People make mistakes under pressure, and you’ll just add to it. Worse you force people to hide things so that they can stay out of trouble. Things that could help solve the situation faster.

Canadian astronaut Chris Hadfield – famous for playing the guitar in outer space, and tweeting fantastic pictures from Ireland aboard the International space station, wrote the book after his adventure where he talks about crisis management. In space, the crisis means that you die in 60 seconds or less, so you should have the best possible approach to Troubleshooting technical problems. Astronauts are not chosen for their careless attitude to risk. Instead, they were selected for their ability to work together as a team, put aside their ego and “work the problem”.

Forget about all the axes you have to grind, and “told you so”, and focus on fixing the problem in front of you. It should be a teamwork to get the best result. If you have seen the movie ‘Apollo-11’, I think that place on earth in Houston, where they say, “you have eight hours to make one of them (carbon dioxide Filter) from the (random selection of pipes and materials)” is a team on the ground to figure out how to save the astronauts using only what was available on the spacecraft.

To work a remote team, this is where the use of online chat tools, such as whatsapp and slack really starts to work. Not only can you print take the heat out of your problem-solving discussion, it also makes it much easier for new people to get up to speed quickly on the issues.

In our case, last Saturday, it quickly became clear that ‘smart’ label that I used on Friday has led to a series of unfortunate events that blocked user. I called one of our developers to help, and we ‘worked’ problems in real-time through slack. It took about three hours, but we survived. Here’s another tip: as CEO of a startup, you make so many decisions and so many parts of the business that will almost always your fault – another reason not to yell at our people.

The second thing you need to do occurs in the first days after the event, assuming it hasn’t killed their business (which happens, unfortunately). You have to spend guilt without opening. This means that you need to write an analysis of what happened with the schedule and comments. It is important that it is safe for everyone, so you can collect the maximum amount of data to understand what is the problem occurred.

This is where online chats in their rights – you have the story of what is happening and no matter how You tried to fix it, so time is easy to write and easy to learn.

What you’ll see in these chats is a process of deduction: hypothesis, test and measurement. Unfortunately, this usually occurs on your system, so it is high risk. That’s why people should be left alone to focus. If you are a non-technical founder, you should resist the desire to keep asking how things are going. Instead, spend your time communicating with the client. If you are a technical founder, you need to get someone in your team to stay connected with the client how you solve this problem.

In your analysis, you should consider this process of deduction: how easy was it to test their hypotheses error? Could you find internal technical information you need quickly. How safe is it to make changes. People have the access they need? People have the information they need.

In our case, we learned that we have implicit and informal dependence whatsapp for customers to report problems. It actually works pretty well, but it’s a random infrastructure, and we need to think about how to make use of this app, and think about backups and alternatives.

It also became clear that our technical knowledge is too specialized. Each member of the team is very focused on his part of the system, and does not have sufficient knowledge of the whole system, to solve the problem. This is not surprising – it is the most effective way to build a system in the context of the run as you will get maximum performance. But as you move away from the minimum Viable product of the situation where you have live customers, the compromises become more expensive. Startups are full of these kinds of transitions. Just when you have something working well, you need again to change it to handle the next phase of growth.

The last thing our post-mortem found that we are not processing errors in the system. Again, the usual startup script. This is something that we need to begin to build.

Finally, the third thing you should do is make the decision based on what you have learned. This is also where you need to make a difficult choice. Most often, you know exactly what the problem is, but you do not have enough resources to solve all of them. Being comfortable letting some fires burn is part of thinking startup. The trick is to choose the method of distribution of risks – at this stage is more art than science, and very dependent on context.

Here are a few actions that we are going to take. We really need a status page. It is a site that shows our customers that the current state of our system. It can also be used for updates in a crisis. Here is an example from Google: status.cloud.google.com. Our will be much easier of course, but the idea is the same. One small tip: unlike Google, you’re not directly connected to a large network of highways, so your domain name can enter during idle. It may be better to use a service status page like statuspage.IO for full redundancy.

We should have a checklist system. This is a technical sheet which establishes the technical status of the system and is the first that needs to be done in an emergency. This is the same idea as checklists that airline pilots use before takeoff. It provides a basic level of security. I highly recommend reading the book ‘the checklist Manifesto by Atul Gawande for more on this topic. He was the surgeon who introduced the checklist into the operating room to improve patient safety.

We must do better by our monitoring system. It was hard to understand that some parts of the system, where to do.

On a technical level, you can use the logs to record activities of living software. It is literally like the logbook that the ship’s captain uses, but much more detailed. We don’t have the opportunity to fine-tune the level of detail. Usually You do not want the logs to be too wordy, as they take up too much space. In emergencies, however, you want all the fine details. We have no easy way to turn on or off – it’s a specific technical effect for the team.

After the incident, it is necessary to write your post-mortem, and then to keep it safe. There will be many more, and adopting the attitude of the investigator of an accident, you will be able to make your system more reliable over time. Complex systems are not stable – they are the ones that survive are those that actively lead to stability.

Update newsletter: 3,313 subscribers, the opening of which 12.2 PC.

Richard Roger is the founder voxgig. He is a former co-founder of Nearform, a consulting firm based in Waterford

Online Business Classes