よいデータセンターとは？ Part3 - Monoblogue of a security engineer

今回は、すべてのコメントの中で一二を争う程気合が入ったコメントを紹介します。
実に気合が入ったコメントであるだけでなく、実際にデータセンタの運用に非常に明るい方が書かれたコメントと思われ、示唆に富んでいます。

少々長くなりますが、以下の書き込みです。

– Raised floor is certainly important, and a given. Check
– Cable management above AND below the floor. This is not an either-or… Check
– Cooling capacity is hard to judge, should be scalable. Redundancy is often overlooked but is often even more important that capacity… Check
– Power quality: never seen a big datacenter without a Liebert, or at least UPS in every rack. Power does not have the be contitioned except between the UPS and the machines/devices. A whole data center power conditioner is often more efficient, but unnecessary for the little guys. either way – check.
– Age is irrelevent as long as it’s under support. If it’s not, replace it. Generators need to be run several times a year to validate their condition, and also to grease the innards… See too many good generators get kicked on and fail an hour later because the oil hand’t been changed in 3 years….
– Outages should be tracked, by system, rack row, and power distro. When system seem to be going down more frequently in one area, there’s usually an underlying reason… As Google recently proved as well for us all, do not ASSUME all is well, routine disgnostics including memory scans should be performed on ALL hardware. Even ECC RAM deteriorates with age (rapidly) and needs to be part of a maintenance testing and replacement policy – Check.
– Fire suppression is usually part of your building codes, and a given, as is the routine checks (at least anually) by law.

In addition, we deploy:
– Man traps on all enterences to data centers. You go in one door, it closes, then you authenticate to a second door. A pressure plate ensures only one person goes in/out at a time (and it it’s tripped, a scurity guy looking at a screen has to override).
– Full 24×7 video surveilance of the data centers.
– in/out logs for all equipment. To take a device in/out of a datacenter requires it being logged in a book (by a designated person). This is for anything the size of a disk/tape and larger. All drive bays are audited nightly by security and if drives go missing, security reviews the access logs and server room security footage to see who might have taken them.
– clear and consistent labeling systems for rack, shelves, cables and systems.
– pre-cable just about everything to row level redundant switches, and have no cabling from server to other servers not passed through a rack/row switch first. Row switches connect to distro switches. This ensures cabling is simple, and predictable.
– Colorcoded cabling: we use 1 color for redundant cabling (indicating their should be 2 of these connected to the server at all times, and to seperate cards in the backplane and seperate switches to boot), a seperate color for generic gigabit connections, another color for DS View, another color the out management network(s), another color for heartbeat cables, and yet another for non-ethernet (T1/PRI/etc). Other colors are used in some areas to designate 100m connections, special connectivity, or security enclave barriers, and non-fiber switch-to-switch connections. Every cable is labled at both ends and every 6-8 feet inbetween.
– FULLY REDUNDANT POWER. It’s not enough to have clean poewr, and good UPS and a generator. In a large datacenter (more than a few rows, or anything truly mission critical), you should have 2 seperate power companies, 2 seperate generators, and 2 fully segregated power systems at the datcenter, room, row, and rack levels. in each datacenter we use 2 Liebert mains, each row has a seperate distribution unit connected to a differnt main, and each rack has 4 PDUs (2 to each distro). Every server is connected to 2 seperat PDUs, run all the way back to 2 completely independent power grids. For a deployment of 50 servers or so this is big time overkill. We have over 3500 servers, we need this… We can not rely on a PSU failure taking out racks at a time which may server dozens of other systems each.

すごく長いですが、頑張ってみましょう。

– 二重床は確かに重要であり、すでに指摘されている。要チェック。
– 床上と床下とのケーブル管理。どちらか片方では駄目だ….要チェック。
– 空調能力は判断が難しいが、スケーラブルではあるべきだ。冗長性は見過ごされがちだが、空調用量よりも重要であるといってもいいだろう。要チェック。
– 電源の品質 : Lievertが導入されていない、もしくはラック単位でUPSが導入されていない巨大データセンタを見ることはまずない。UPSと機器の間を除けば、電源が調整されている必要はない。データセンタ全体で調整する方が効率的だけれども、小規模なデータセンタには不要だろう。いずれにせよ、要チェック。
– 利用年数は、よくメンテナンスされている限りは関係ない。メンテナンスされていないのであれば、交換すべきだ。発電機は、そのコンディションを確認するために年に数回は動かす必要があるし、内部にグリースを塗る必要もある…3年もの間オイル交換をしていないがために、起動後たった1時間で止まってしまう発電機のなんと多いことか….
– 停電の履歴は、システム単位 / ラック列単位 / 系統別に記録されているべきだ。特定のエリアで頻繁にシステムがダウンするようであれば、根深い理由があるはずだ…最近、我々のためにGoogleが証明してくれたが、すべてがうまく動作していると思い込んではならない。メモリースキャンを含む定期的な診断がすべてのハードウェアについて必要なのだ。ECC RAMでさえ、時の流れと共に(すぐに)壊れるので、メンテナンスと交換のポリシーに含めておかなければならない。要チェック。
– 消火システムは通常はビル側の問題になるが、そうであるのであれば、定期的なチェック(少なくとも年に一回)が法律に基づいて行われるだろう。

更に、
– データセンターのすべての出入り口にマントラップ。一つ目のドアを入るとそのドアが閉まり、二つ目のドアの認証を行う。感圧プレートにより、確実に一度に一人しか出入りできないようにする(そして異常を検知した場合、画面を監視していたガードマンが駆け付ける)。
– データセンタの24時間365日の監視カメラによる監視。
– すべての機器の搬入出記録。データセンタへの機器の搬入出に際して、(指定の人物が)記録簿に記録を取ることを要求する。このルールは、ディスク/テープ以上の大きさのものすべてに適用される。すべてのドライブベイは毎夜警備員によって確認され、ドライブがなくなっていたら、警備員がアクセスログとサーバルームへの足跡とを確認し、誰がそれを持ち去ったのかを確認する。
– 明瞭かつ持続的なラべリングシステムがラックや書棚、ケーブルやシステムに適用されている。
– 列単位で冗長化されたスイッチへの先行配線が行われており、ラック/列のスイッチを経由せずにサーバ間で直接配線が行われていないこと。列のスイッチは基幹スイッチに接続されている。これにより、配線がシンプルかつ予測可能になる。
– 色別のケーブリング。我々の場合、冗長化された配線にある色を利用(これにより、常にその色のケーブルが2本サーバに接続されており、異なるカードと異なるスイッチに接続されていることになる)し、ギガビット接続には別の色を、統合監視ソフト向けには更に別の色、管理ネットワーク向けに別の色、ハードビートケーブル向けに別の色、イーサネット以外(T1/PRIなど)には更に別の色を利用する。100m超の接続、特殊なコネクティビティ、別のセキュリティエリアとの接続、光ではないスイッチ間の接続などにも専用の色を利用する。すべてのケーブルは、6-8フィート毎にラべリングされている。
– 完全に冗長化された電源。clean power(波形がきれいということ?グリーンな電力であるということ?)だけでは十分ではなく、よいUPSと発電機も必要とされる。大規模なデータセンタでそこに数列以上のラックを持つ、もしくはミッションクリティカルな何かを置くのであれば、異なる2つの電力会社から供給を受け、2系統の発電機を持ち、2系統の電源システムをデータセンタ/サーバルーム/ラック列/ランク単位で持っているデータセンタを選択すべきである。我々が利用しているデータセンタは2系統のLiebert(UPS)を持ち、それぞれのバッテリの列は独立した2つの分電盤に接続されており、各ラックは4つのPDU(1系統につき2つ)を持っている。すべてのサーバは独立した2つのPDUに接続されており、完全に独立した2つの電力グリッドから電源供給を受けている。展開しているサーバが50台程度であれば、これは過剰スペックといえるだろう。我々は3,500台以上を運用しているので、我々にはこのクラスの設備が必要になる…

確かに、よく探せばここまでの運用を行っているデータセンタはあるでしょう。私が見てきたデータセンタの中でも、これに近いデータセンタはありました。電源や空調などの設備については、基本的にデータセンタを建造する際にそのスペックが決定してしまうので、自分が顧客となる可能性があるのであれば構成を知ることは可能でしょう。

知るのが難しいのは運用で、メンテナンスが適正に行われているかどうかなどは、自分が大量のラックを借り受けるユーザであれば知ることはできるでしょうが、少ししか借りていなければ教えてくれないデータセンタもあるでしょう(ありました)。ちなみに、ケーブリングについては、特に東南アジア系のデータセンタでは壊滅的な状況の所も実際に見たことがあるので、可能であれば事前に見ておくべきでしょう。
自分のラック周りだけは自力でどうにでもできますが、共用部分がメチャメチャであった場合、それで障害の解析が長引いたりする可能性があります。

しかし、この翻訳には本当に時間がかかってしまいました…

コメントする コメントをキャンセル

コメントするコメントをキャンセル