Main

April 4, 2008

Explaining RAID Disks

I hate acronyms.

Of course they're impossible to avoid if you work in IT, but that doesn't mean I have to like them.

RAID is one of those acronyms that really wrecks my head.

Not RAID itself which means simply Redundant Array of Inexpensive Disks, but the different types of RAID array that can exist.

RAID, in case you didn't know, is a way of improving redundancy and performance in servers by using multiple disks. Of course you don't have to use multiple disks, but if you don't you will run the risk of losing data.

A few months ago someone posted a very nice simple graphic that explained the differences between the various types of RAID arrays.
raid-explained.jpg

Taken from: http://www.epidauros.be/raid.jpg

One of our technical staff sent me a link earlier today to a bash.org quote which sums up the potential issues with single disk servers very nicely:

sterano: Whats the difference between Raid_0 and Raid_1? Steve: In Raid_0 the zero stands for how many files you are going to get back if something goes wrong.

Moral of the story - use more than one physical disk :)

March 25, 2008

Blacknight Technical Blog Now Live

If you want to know about any service affecting maintenance, technical updates or anything else of a technical nature, we recommend that you check out our new Technical Blog.

The site is hosted outside our core network (we don't even use our own nameservers just to be 100% safe!) and is part of our backup / contingency plans for emergency situations.

While our network uptime has been and hopefully will continue to be exemplary there's no reason to be lazy. We need to make sure that we have a system in place in case there is an issue NOT after the issue arises.

You can subscribe to the site's RSS feed OR to the email alerts.

Your choice :)

You can sign up for the email alerts by filling out the form below:


Enter your Email






Preview | Powered by FeedBlitz


March 20, 2008

Blacknight On WebmasterRadio.fm

retro radio


Journalists call from time to time asking me to talk about various internet related topics. Most of the time the publications or shows are "general interest", so you can only talk about very general things.

Last night, however, was quite different, as I was one of the guests on "Domain Masters" which is broadcast and streamed weekly at 7pm EST (11pm in Ireland, midnight CET)

The show's host last night was my good friend Jothan Frakes who is one of the domain name industry's gurus.

Although I was very nervous (which probably showed!) we had a nice chat about Blacknight, domains and the internet industry.

If anyone wants to hear the show there should be an mp3 version available on the WebmasterRadio site at some time over the next couple of days.

UPDATE: The Mp3 from last night is now available on the site http://www.webmasterradio.fm/Internet-Marketing/Domain-Masters/Geo-Domain-Expo-and-BlackKnight.htm

UPDATE 2: Of course if I provided proper hyperlinks people might be actually able to use them!
So here you go: Show details including podcast

March 6, 2008

INEX connectivity Upgrade

When: INEX LAN#1 connection being upgraded @ 23:00 on Monday 10th of March.

What: We currently have 2 x 100M connections to INEX. Our LAN#1 connection carriers a lot of our INEX traffic and as such we're upgrading it to 1000M to prevent it being a bottle neck for traffic originated in Ireland. This is a simple software configuration change for the port speed on one of our routers and requires INEX operations staff to do the same on their end.

There'll be a brief hit as all our INEX peerings on LAN#1 go down and traffic re-routes over LAN#2 and transit. This should only be temporary and peerings should re-establish automatically after a few minutes.

Summary: @ 23:00 on March 10th we're upgrading our primary Connection to INEX to 1000M. Traffic reaching us via INEX peerings on LAN#1 should be re-routed via LAN#2 and transit with minimal downtime being incurred, just the time it takes for BGP to reconverge.

Update: 23:25 March 10th 2008

This upgrade went ahead without a hitch. We're now running at GE @ INEX on Lan#1. We'll upgrade Lan#2 later in the year as necessary.

Inter DataCentre connectivity testing

When: Monday 10th of March @ 22:00 hours

What: Firstly we've recently lit our own protected wavelength between DEG and InterXion. It has been in place and in testing for a few weeks now. We need to test the failover on both the long and short legs of this new connectivity and also check the failover to the backup layer 2 paths in the event of both the short and long legs getting damaged at the same time.

We don't expect any downtime during this testing as our layer 2 network normally fails over within a few milliseconds.

Secondly we're moving the InterXion firewalls to the new Distribution routers in this location. This change should take 30 seconds or so to propagate within our network as it's a logical Layer 3 change.

This will mean the firewalled network in DEG and InterXion will be seperated from each other and traffic originating and destined for each data centre doesn't need to traverse our metro network.

For complete testing we'll allocate 2 hours to perform these tests. We don't envisage anything more than a few 10-30 seconds hits on metro traffic (so won't affect everyone) and it will only cause slow loading times for some websites and not others.

Summary: Works begin @ 22:00 hours on Monday 10th and end at midnight on Monday 10th. There should only be a few short hits on our metro links as they failover while we simulate fibre cuts, switch failures, port failures etc.

Update: 23:58 March 10th:

These tests have been completed. The inter DC links have been tested in several scenarios are we're happy it's quite resilient now. We also moved the InterXion firewalls to a new distribution router pair in InterXion from the DEG routers. This took a little longer than expected, around 3 minutes and 40 seconds or so, slight OSPF glitch in the config which took a minute or two to find. All went to plan except that firewall move in the initial stages.

February 6, 2008

Metro-ethernet ring outage - Non service affecting

Overview:

On 10:04am on 4/2/2008 an ethernet card failed in a device on one of our metro-e providers Layer 2 connectivity device in DEG. Immediately (within 50ms) our kit failed over to our backup route into DEG. There was no service disruption during this window due to our resilient network design. At 12:00 the card was replaced and this link came back up and we flipped our traffic back over to our primary link. Again service was unaffected.

We received the RFO from our metro-e provider yesterday afternoon that basically said what I've described above. A card failed and it was replaced within 2 hours.

January 3, 2008

Scheduled Network Maintenance Wednesday/Thursday 9th/10th of January 2008

When: Starting Wednesday 9th @ 22:00 and ending Thursday 10th @ 01:00

What: Migration of Dedicated, Colocation and IP transit customers
to new Juniper network layer.

In December we bought a bunch new of Juniper routers to upgrade
our core network with. The ones that were there, were almost 2 years
old and were due an upgrade.

We'll have the new Juniper router pair pre-configured with all prefixes
and BGP sessions. We'll slot it into place and clear the arp cache
on all affected layer 2 devices and shut down the old device. There will
be approx 10-30 minutes where routes to certain parts of our network
are unavailable.

This will also remove the need for our old IPv6 configuration. We'll now
have end to end native IPv6 core running on the Juniper platform. We're
the first hosting company in Ireland to build a native IPv4, IPv6 network
core on the Juniper platform and we're very proud of this fact.

Who will be affected:

Customers on our unfirewalled network (who have their own routers or
firewalls) or IP Transit customers.

This affects both customer groups in InterXion and DEG locations. If you
are unsure if this affects you or not, give us a call or drop an e-mail
into support@blacknight.com

Summary:

On Wednesday 9th starting @ 22:00 hours we'll be performing maintenance
on the routers that run our un-firewalled and IP Transit networks.

October 18, 2007

Firewall Upgrade Completed Successfully

fireworks

The scheduled maintenance for last night went ahead on time.

According to our engineering team most people would have been affected very briefly (less than one minute).

If anyone is experiencing issues please let us know ASAP. While everything has been tested thoroughly and we have not had any reports of issues to date there is always a possibility that someone was affected - let us know if you were.


Personally I'm overjoyed that the upgrade was finally completed, as it means that our network is a lot more resilient than previously, which means I get to sleep more soundly at night!

October 11, 2007

Scheduled Network Maintenance - Wednesday 17th of October @ 22:30 hours

When: Wednesday 17th of October @ 22:30 hours

What:

Firewall Upgrade. We're moving our colocation and dedicated server
customers out from behind the current HA pair of firewalls. We've
indicated recently on our blog that we bought 4 new Cisco ASA firewalls
and the time has arrived to install them.

Who will be affected:

Both of our firewalled networks will be affected by this. Firstly
our shared hosting firewall will be moved to a new IP address on the
WAN side to facilitate VPN configurations for our colocation and dedicated customers.

Secondly the new ASAs will be put in place and they will replace the current
firewalls and access routers for these customers.

We estimate around 30 minutes to an hour to move the shared hosting firewall
and around another 30 minutes to an hour to facilitate the new firewall
install. This includes all the cabling work etc that will need to be done.

We will also allow a further hour for testing of both networks, so we're looking
at a maximum of 3 hours for this work to be completed.

Summary:

All colocation, dedicated and shared hosting customers will be affected by this outage.

September 17, 2007

New Cisco Firewalls

Following on from last Tuesday's incident we are following through on our promises.

Our technical team had been discussing the finer points of various firewalls for some time. When it comes to choosing equipment they always spend quite a bit of time evaluating the options. They have to take into account a lot of different factors.
How well will it work with existing equipment?
Will it scale?
How long before we have to replace it?
How much does it cost?
Do we have staff who know how to use it?
Does it support ipv6?
How much traffic can it handle?
How many concurrent connections can it handle?
How much RAM does it need?

The list goes on and on...

In the end we decided to go with Cisco ASA 5500 series.

And since we love our camera phones here are a couple of snaps of the new firewalls. Before anyone asks - I'm not 100% sure when they'll be installed.

cisco-asa-firewall-frontview.jpg


And from behind:

cisco-asa-firewall-rearview.jpg

And a slightly further away shot:

cisco-firewalls-longview.jpg

Unscheduled Network Outage - Sunday 16th 18:40 - 19:05

Summary:

An internal routing issue developed in our network between our edge routers and our core distribution routers.

Diagnosis and Resolution:

During regular network maintenance Blacknight staff were moving a customer from the shared vlan to their own VLAN. During this move we were forwarding IP packets from their old IPs to their
new IPs. Normally this should not be a problem. However in this case, the rules caused OSPF on the primary distribution router to flap.
As a result of the session flapping the secondary router was not able to take over correctly. We manually failed over to the secondary router and at this point the network stabilised.

We are still investigating this issue, but we believe a router upgrade that is planned for later in the year will fix this issue permanently.

September 12, 2007

Incident Report - Tuesday September 11 2007

Yesterday at lunchtime there were some issues on our network.

I'll try to explain what happened in simple terms and also explain what we are going to do to avoid this type of issue arising in the future.

If anyone has any queries about the explanation please feel free to ask via comments or email us directly.

Timeline: 13:55 - 14:18
Affected Customers: Any customer on the shared firewall that has a dedicated server or has colo with us was affected during this incident. This also included our shared hosting clients.
What happened?

At around 2pm yesterday afternoon a segment of our main network was sluggish and people would have experienced latency and packet loss.

Why?

As you may know our main network is firewalled. We have a pair of firewalls setup in HA (high availability) to protect the bulk of our clients, which includes all our shared hosting clients on both windows and linux, as well as a large number of clients on dedicated servers or with colocated machines.

Firewalls are basically computers. Depending on how much money you want to spend on them you get different capabilities. While our firewalls are perfectly adequate under most conditions they have limits.

When a server behind the firewall was compromised and started pumping out large amounts of traffic the firewalls were pushed to capacity. While the network was up at all times it would have been slow and unresponsive until our engineering team were able to take action.

What action was taken?

The server that had been compromised was disconnected from the network until the issue had been resolved / removed.

How can we avoid this in the future?

We had been planning to upgrade the firewalls in any case, this is now being moved forward. The new firewalls will be able to carry larger amounts of traffic so this kind of issue will have a lower impact should it arise again.

For the last few months we have also been actively encouraging clients to opt for their own firewall(s).

And now for the more detailed breakdown:

Outage Information with Timeline of Events

13:53 C program downloaded onto a customer's machine via a hole in their
programming code.
13:55 Code compiled and executed. A result of this was 80mbit/s of
additional traffic heading towards the shared firewall service during peak lunch time traffic.
14:05 Our engineering team noticed latency of SSH and terminal services connections to machines on the network behind the firewall were laggy or intermittent.
14:06 Senior onsite engineers begin to investigate the issue.
14:08 One of our external traffic links was carrying approx 50mbit/s more
traffic than normal (some traffic from the affected host never made it past the firewalls) and they begin to check access switches for which equipment cabinet has the infected host.
14:15 The host responsible for this increase in traffic was identified and
their switch port was shutdown by a network engineer.
14:16 Services begin to return to normal and the load on the firewalls CPU
drops back to acceptable limits.
14:18 All services are back to normal

September 5, 2007

What kind of services do people want? Give us your feedback!

There are several reasons why this blog exists and one of them is to get feedback from clients.

It may come as a surprise to people, but we actually do pay attention to what they say to us and about us.
I'd love to think that we do a good job all of the time, but there may be aspects of our service that fails to meet your expectations and if that's the case I'd like to know about it. (If you don't want to comment in public you can always email me directly: michele@blacknight.eu ). It might be something as simple as the way we worded our product or service offering ... If people don't let us know we have no way of knowing!

We are currently working on rolling out a new suite of websites and we will be unveiling a whole range of new products and services over the coming months. I'll be teasing you all with little details as we finalise the details, but now is also the ideal time for us to take your feedback. If you want us to offer something that is feasible then we might just do that. Of course we might think your idea is crazy ... but if you don't talk to us we will never know.

What kind of services would you like to see hosting providers like us providing in the future?

What elements of our current hosting plans would you like us to change? (I'm not saying we will change them, but I am more than willing to listen)

Which technologies would you like to see us offering in the future?

August 16, 2007

Network Maintenance - Cisco Switch Upgrade

Our team of networking engineers like to keep our network running smoothly and I am happy to say that they do a very good job of it overall.

Of course this means that from time to time they have to upgrade and patch things.

So between 27th and 28th of August we will be doing maintenance on our Cisco switches, which involves upgrading the IOS on some of the devices.

The affected switches will have to be rebooted, so there could be a loss of connectivity for up to a minute as the devices reboot.

Since we're doing this in the middle of the night it should not affect many clients as it is set to happen between 2200 and 0200 GMT. If you are located in the US for example that would be 1600 to 2000...

In any case there's more info on the forum

August 10, 2007

Fibre Issue Update

We have received a detailed report from our fibre provider regarding last Friday night's outage.

As the report is very long and highly technical I won't be publishing it here.

If anyone affected by last Friday night's issue would like more information about the steps that both ourselves and our fibre providers are taking to avoid future issues please let us know.

July 2, 2007

Network Upgrade (again!)

Just to let people know that we are doing yet another network upgrade next weekend.

When?
Friday 6th, Saturday 7th of July
Time: 22:00 - 02:00

For full gorey details see this post on our forum

June 16, 2007

Network Maintenance Followup

Just to followup on the network maintenance from the other evening.

The work went ahead as scheduled and we have not had any reports of issues from clients.

We will be announcing the next phase in our network upgrades and maintenance plans in the coming days.

To keep abreast of these changes I'd recommend you subscribe to our RSS feeds :)