Server room temperature myth busting – energy savings, disk failure and temperature

Saving energy in server rooms is often the last area tackled in large office environments. This may be due to a lack of understanding about what can be done to save energy and also due to barriers created by out of date information relating to server temperature requirements. On this point, a particular incident comes to mind. Whilst conducting an energy assessment of a large office, in response to a recent server drive failure, I saw the IT manager of the civic centre turn down the server room cooling from 18 degrees C to 16 degrees C. From witnessing actions like this and from several discussions on server room temperature, what seems standard in many IT departments is a paranoia amongst IT professionals to over cool servers in fear of disk failure. I suspect this IT manager might not have made his decision to turn down the thermostat so quickly should he have read some of the studies mentioned here.

In this post, I am going to look at some surprising studies and recent change in thought around server temperature, drive failure rates, and energy savings.

In 2008, the American society of Heating, Refrigeration, and Air-conditioning Engineers (ASHRAE) changed their recommended server room temperature and humidity range. The table of changes has been taken from ‘2008 ASHRAE Environmental Guidelines for Datacom Equipment’ [1] and is displayed below:

The changes in recommended temperature and humidity ranges have been driven by a demand to reduce energy consumption in server room cooling. ASHRAE’s changes mean that free cooling delivered to some server rooms through the use of an economiser can achieve greater energy savings by operating more often instead of energy hungry chillers. Free cooling limits the dependence on electricity driven chilling which according to a recent paper [2], contributes 33% of typical data centre electricity consumption.

But a lot of servers don’t have free cooling, so are there savings in these cases?

A study published by Dell dated 2009 titled ‘Energy impact of increased server inlet temperature’ [3] addresses this question through controlled testing of a variety of server configurations, loads, and cooling arrangements. The study highlights that there are three types of inbuilt server fan arrangements.

1. Fixed speed inbuilt cooling fans that aren’t dependent on inlet temperature.
2. Stepped speed inbuilt cooling fans that step based on server inlet temperature thresholds.
3. Variable inbuilt cooling fans that vary their speed smoothly as inlet temperature increases.

This variability in inbuilt server fan speed means additional power is drawn as inlet temperature increases for the second and third fan types listed above. This means that there is a trade-off between the reduction in chiller cooling loads and the increased requirement for inbuilt server fan cooling. The findings of the study show that depending on the server mix, cooling arrangement and load conditions, there is a sweet spot whereby minimum energy consumption is experienced. This spot lies around the 24 to 27 degree C inlet temperature. A diagram of one of the tests is displayed below:

So raising server inlet temperature to between 24 and 27 degrees C can result in energy savings, but is it safe?

Yes, a lot of evidence suggests that it is. Looking to Google’s 2007 publication ‘Failure Trends in Large Disk Drive Populations’ [4] we are presented with a study of the largest disk drive population at the date of publication. The study highlights that ‘Contrary to previously reported results, we found very little correlation between failure rates and either elevated temperature or activity levels’. In fact the study showed that there is a clear trend showing that lower temperatures correlate with higher failure rates, and that only at very high temperatures is there a slight reversal of this trend.

The following two graphs show key results of the study. The first shows the average failure rate (AFR) as a function of temperature (depicted as dots with error bars). The second graph shows the AFR as a function of temperature and drive age.

These graphs suggest that the IT manager’s decision of turning the set-point down from 18 to 16 degrees C, may result in an increase in disk drive failure. It would be interesting to have measured the inlet temperature of the drive that failed.

In this post I have tried to present a clear argument for evaluating server temperatures. I would suggest thoroughly profiling the temperatures in a server room before deciding to increase temperature set points. My recommendation would be to conduct temperature logging at several server air inlets. Do not rely on the room thermostat to tell you what is going on at the inlet of a server. It could be that whilst an air conditioning unit is set to 18 degrees C, it is incapable of actually reaching this set-point and therefore the air conditioning thermostat is a poor mechanism of measurement of actual server room temperature. In addition to this, a server room could exhibit hot spots due to poor air mixing. This could result in some servers being supplied air at inlet temperatures greatly exceeding what is recommended in this post. It would also be interesting to conduct power logging to verify the savings before and after the alterations. More case studies are needed to debunk the myths about server room temperature.

Please feel free to comment below.

[1] 2008 ASHRAE Environmental Guidelines for Datacom Equipment (Expanding the Recommended Environmental Envelope) – http://tc99.ashraetcs.org/documents/ASHRAE_Extended_Environmental_Envelope_Final_Aug_1_2008.pdf
[2] Improving Data Center PUE Through Airflow Management – http://www.coolsimsoftware.com/LinkClick.aspx?fileticket=KE7C0jwcFYA%3D&tabid=65
[3] Energy impact of increased server inlet temperature – http://i.dell.com/sites/content/business/solutions/whitepapers/en/Documents/dci-energy-impact-of-increased-inlet-temp.pdf
[4] Failure Trends in Large Disk Drive Populations – http://static.googleusercontent.com/external_content/untrusted_dlcp/labs.google.com/en//papers/disk_failures.pdf

This entry was posted in Energy efficiency. Bookmark the permalink.

Comments are closed.