Hey, You, Get off of my Cloud!

The Cloud. It's amazing. Isn't it wonderful that they have now managed to link all of the computers in the world together so that no matter how much computing power you need it is all there, ready to turn on like a tap.

That's the truth of The Cloud, isn't it?

That's certainly the way it has been sold to the world. The Cloud is a limitless supply of storage/power/systems* [*choose whichever you want, because you can be certain that someone is selling that limitless supply concept to you].

But in reality. Behind the scenes, like the Wizard of Oz, behind the curtain... it's the same old computing hardware that we've always had. Sure, the management tools and processes make it easier for the Cloud provider to move your workload and files from their existing home to a bigger home without you noticing... but it's still fundamentally your applications running on a processor and your files being stored on a disk. The Cloud provider still needs to do some intelligent Capacity Management to make sure that there is always "somewhere bigger" onto which client workloads can be placed. It's no good selling the concept of "Capacity on Demand" from the Cloud, if the moment that a client asks to place that demand it takes a few days/weeks/months to make it available. They need to maintain a level of "headroom" of unused capacity in reserve just ready for the moment that the clients ask for it. Exactly how much headroom is the subject of a later blog. For now, I want to look at the other element of this perfect Cloud that often escapes the attention of clients. Performance.

Let's consider processing power. The provider may have a total Cloud requirement made up of the individual client systems that need 1000 processors. A decision that all Cloud providers make is how much "over provisioning" they will deliver. That is, if the total clients' requirement is 1000 processors, and they only have 500 processors, then they're over provisioning with a factor of 2x. The Cloud provider is basically assuming that at any one time only half of the Clients' requirements will actually be in use. Consider it this way. If you have a physical server with 4 processors, and your CPU utilisation only ever peaked at 50%, then in reality you're only using 2 processors.

This is the economics of the Cloud. It is priced as being cheaper than physical infrastructure, not only because you do not need to host the infrastructure yourself, but because the Cloud provider knows that the chances of you using ALL of the resources that you are paying for are pretty slim.

But what if they're not slim? What if you're paying for 4 processors and you actually WANT to use all 4 processors? That shouldn't be a problem... after all... the other people on that same Cloud aren't all going to want to use their full allocation at the same time, are they?

But what if they do?

This is the point of this post. Hey you... get off of MY Cloud!!! Sometimes you may notice that your Cloud server is running a little slower than normal. You can't quite put your finger on it, and all the monitoring stats within the Guest server show normal activity, but the user response times don't match up. "Something" is definitely slowing you down. And that "something" is going to be another client on the same Cloud who just so happens to be hitting their peak demand at the same time as you. The better Cloud providers will be watching for this, monitoring activity at the Cloud level to see when this type of contention is being approached. The better Cloud providers will proactively move client servers around the Cloud to avoid this problem ever becoming a reality.

Just make sure that your provider is doing this, or if you are running the Cloud (public OR private) that you are doing it... otherwise you'll hear a lot of your clients singing the Rolling Stones classic song!

Posted in Blog Posts, Capacity Management | Tagged , , , , , , , , | Leave a comment

One way, or another....

 

There is a known pyramid based model that attempts to describe the relationship between Data – Information – Knowledge – Wisdom.  I have reproduced the pyramid below.

 

Convention suggestions that one starts with Data, lots of it.  You need to collect a lot of data before you can do anything else.  Some of that data will be useless to you, but you need to collect it all before you know which is the ‘good stuff’ and which is the ‘rubbish’.  The Pyramid is known as the DIKW pyramid *(don't try and say this out loud in the office!!!) https://en.wikipedia.org/wiki/DIKW_pyramid

 

 Once you have a lot of data, you can then sift through the raw numbers and start to make assessments on it.  We haven’t discussed what this data might be, and that is because this is really irrelevant to the model.  The data might cover processor utilisation, it might cover datacentre occupancy, it might even cover the number of call centre staff on duty at any time.  The key thing is that they are data values.  By analysing this large volume of data, you can do the assessment that turns the data into information.  For many people, this is the final step in their journey.  They gain some level of information from their data and make bold pronouncements.  It might be something along the lines of ‘most call centre people turn up to work 5 minutes late’ or ‘our data centre is 70% full’.  The reason that this is the WRONG stage to stop at is because while you have information you don’t really have any knowledge about what is going on.  Ask yourself the ‘so what?’ question.  In the case of the datacentre being 70% occupied, the ‘so what?’ question would leads me to ask how much space we actually have? I would want to know whether that space is in contiguous lumps so I can actually use it for a new rack or storage array.  You have Information, but no Knowledge.

 

Knowledge is gleaned from Information when you give it some context.  If you can give the above statements context which is meaningful, then you are gaining knowledge:

-          Most call centre people turn up for work 5 minutes late, and this means that we are missing our call handling SLAs during that initial 5 minute period.

-          Our datacentre is 70% full and we have 200 U of contiguous space.

 

But we must still ask the question, ‘so what?’

If you finish your journey on Knowledge you will be the cleverest person in your organisation, and certainly the most informed, however you won’t have done anything to of benefit.  That is where the Wisdom comes in.  Knowing that something should be done to improve the business using the Wisdom that has been gleaned from the Information is the goal.

-          Ensuring that all call centre staff arrive promptly will ensure that the business does not incur penalties from missed SLAs.  You have the data to prove the issue, and the analysed information to show how the SLAs are currently being missed.

-          Mapping the future demand for datacentre space against the known remaining contiguous space allows you to predict when an expansion to the datacentre facility will be required.

 

But my contention with all of the above model is the assumption that the journey MUST start with the data.  I suggest that this is inefficient.  If you don’t already have a rough idea of what problem you are looking to resolve, then you will waste a lot of time collecting irrelevant Data and doing unnecessary analysis to turn this into pointless Information.  You wouldn’t collect data pertaining to call centre seat occupancy if you were hoping to end up with the Capacity Plan for Datacentre space.  That’s obvious.

 

So why collect meaningless performance stats from your IT systems that you don’t even know how to analyse, or worse still have no place in your forecasting?  As an example, while per process stats are really useful for diagnosing performance issues, it is the overall processor (and memory and storage) stats that are going to most beneficial for predicting when an upgrade to the infrastructure might be required.  The process level information will be useful for building a capacity model, but there is little benefit in retaining 5+ years of per-minute stats.  You’re never going to get around to analysing each and every minute, and there would be little substantive benefit in doing so.

 

Understanding what data to collect in the first place IS Wisdom.  Having that Wisdom allows you to share your Knowledge of what analysis needs to be done within the Capacity team.  If the team know what to analyse they have the Information to decide what Data to collect in the first place.

 

I contend that the pyramid makes more sense in an efficient business environment like this:

It looks very similar, but thanks to the benefit of having an expert on hand, you aren't wasting time and effort looking at the vast reams of Data that aren't within the value pathway to Wisdom.

I refer to this as the WKID pyramid.  Now that is something that you CAN say in the office out loud without shame!

 

 

So I think you have to have someone in your organisation who can start the journey from the top (Wisdom) so that you are as efficient as possible.  As this blog is titled..... One way... or another... and Debbie Harry knew.. I'm gonna getcha!

 

Posted in Blog Posts, Capacity Management | Tagged , , , , , , , , , , , | Leave a comment

Just can't get enough

I just can’t get enough!   It is a familiar cry from infrastructure managers the world over.  No matter how much resources you throw at an application or service, the infrastructure manager will claim that they still don’t have enough resources to get a good quality of service from the application.

 

We Capacity Managers know that the solution isn’t as simple as just throwing more CPU or more Memory at the problem (although that’ll be what is being asked for).  The client might equally well be asking for more disk space, or maybe more disk spindles, or maybe more network bandwidth.  Either way… whatever they have, it won’t be enough.

 

Recently, this perpetual insatiable demand for “more”, has been observed in a subtly different way.

 

I have been managing a client’s private cloud as referred to many times in past blogs.  The client allowed users to select the configuration of their guests from a pre-defined pick-list of sizes.

Guest size Option vCore vMemory (GB)
A 1 4
B 2 8
C 4 16
D 8 32
E 16 64
F 24 96
G 32 128

 

As you can see, there are only 7 size options, and there is a fixed ratio of 4Gb of vMemory for each vCore.  This type of tightly defined shopping cart is very useful to the Capacity Manager.  We know that there can only be a limited number of building-blocks for allocation within the Cloud.  Furthermore, we can monitor the existing client behaviour and work out the probability of each option (A-G) being chosen.  This gives us a typical distribution of choices as shown in the chart below.  We can take this a step further, and work out how many guests in a typical option distribution we can fit onto a typical cluster.

 

Any high school student knows about the Normal distribution.  They would expect that the majority of users would select guests of size C, D, and E.  Only a very few users would choose option A (the smallest) or option G (the largest).  However the chart shows something different.

 

More people that otherwise expected are choosing the larger guest sizes.  Even though these options cost sufficiently more that the smaller options, human nature comes into play.  People choose the larger sizes simply because they can.    This was brought into sharp focus when the shopping cart of options above was changed.

 

The client decided that, rather than fixing the ratio of vMemory to vCore at 4:1, they would release the shackles and allow clients a freer rein of choices.

 

Guest size Option vCore vMemory (GB)
A 1,2,4,8,16,24,32 4,8,16,24,32,64,96,128,256

As you can see, while the choices of vCores remained unchanged, the client introduced two new options for vMemory (24 and 256).  The impact on the distribution of selected vCores and vMemory is shown in the two charts below.  Again, more users than would normally be expected are selecting the maximum vCore and vMemory possible, and (although I won't go into the detailed calculation here) readers with experience in mathematics will immediately notice that the more pronounced increase in vMemory requested compared to vCores means that the traditional 1:4 ratio of vCore to vMemory has shifted to closer to 1:5.

 

 

 

 

 

 

In the client’s private cloud architecture (where resource limits are based upon Allocation rather than Utilisation as blogged about earlier) this further reduced the number of clients that can be accommodated within each cluster.  When interviewing users about the choices that they made, they clarified that they had ordered the larger sizes simply “because we can”.  Even when the impact of this choice on their fellow users was explained (ie: that by ordering the largest sizes of guest it meant that fewer users would be able to build their guests) user behaviour was unchanged.

 

What this showed me, is that there is a part of the Capacity Manager’s role which has to include considering the psychology of users.  It is human nature to a) be selfish, and b) to want the best you can get.  It is possible to teach sharing and consideration for others.. however in a competitive business environment you will be fighting a losing battle.  You cannot consider technical decisions in isolation without considering how human psychology will change that into an actual outcome.

 

Users…. They just can’t get enough!!!

 

Posted in Blog Posts, Capacity Management | Tagged , , , , | Leave a comment