Select your font size 
 
about us products & services consulting & support news & events contact us
Web hosting applications often require extreme uptime. Server hardware comprises one part of how to obtain this level of uptime.

Transparen's Experience in Web Hosting - Yukon

print this article 
 

How Transparen Got Started in Web Hosting

Let us discuss how Transparen got involved in web hosting in the first place. As a consulting company which started by offering project management and php/mysql programming services, our first few website projects were hosted on other people's servers. This had several advantages:

Advantages in Not Doing Web Hosting

  • If something went wrong, we could blame the hosting provider.
  • If we couldn't do something special, we could blame the hosting provider.
  • We could save the day because generally we were much more on top of the client's project and needs than the hosting provider.
  • The monthly costs were borne by the client and we did not need to assume any interest in these costs.

Also, when a hardware problem occurred (a hard drive crash) we discovered that our hosting provider did not maintain good backups, and when something bad happened, it was only because we were diligently taking backups of our client's software that we were able to gracefully recover from a server crash (the hardware and operating system aspect of which they dealt with, and which thankfully was not too stressful to our team as we were providing only the software and database development, not the operating system support or the hardware-level support... and we had very good backups, so we rested pretty easily, and perhaps a bit smugly, whenever bad things happened).

Started Managing Operating System to Provide Value-Added Services

However, we soon learned that Transparen's clients needed more specialized software to run on their server, so we created a private hosting environment by leasing a computer from a hardware-managed hosting provider (i.e. they handle the hardware, we manage the software. This is often confused with colocation. Colocation is purely rental of space, electricity, and bandwidth. Hardware-Managed Hosting often bills itself out as being a 'Managed Server'. Essentially it is a colocation contract, plus a leasing contract, plus a management contract that probably does not allow you to take over the duties of solving whatever the hardware problem may be.). The experience with hardware-managed hosting went well for a year or so, but then went badly when there were two back-to-back hard drive crashes, two weeks from one another, the last of which required costly and time-consuming data recovery while we and our clients fumed angrily at the hardware managed hosting provider and talked about upcoming lawsuits and our mounting losses due to their handling of the situation.

Hardware-Managed Hosting Gave Us Crap Hardware, and No Control

In the process of dealing with two back-to-back hard drive crashes through our hardware-managed hosting provider, we learned something about our "brand new" computer that we had leased: the first hard drive was crap - and not only was it crap, but it was refurbished crap that had crashed before and been sent back into use. And the second hard drive, the one from recovering the server after the first crash - yup, that was refurbished crap too. The other thing we had learned was that if something went wrong with the hardware, there was absolutely nothing we were allowed to do about it. If the crash happened at night, we could not even take a look at the machine. We just had to wait. Outside. Essentially, we had to sit in a 24h internet cafe, ping the server in hopes that the machine would soon come back up as a result of the hardware-managed hosting provider's efforts, and be ready to take over the software recovery part as soon as SSH access was restored.

Now, by the time these two hard drive crashes had happened, we had already developed quite an expertise in data recovery and what to do about hard drive crashes, having become the de-facto go-to company in Vancouver and Surrey and the GVRD for data recovery, and having emergency services available - so for both crashes, an on-call technician was ready to help and handled the hard drives to recover the data. But that's not the point.

Software Control Was Not Enough - So Transparen Bought a Server and Colocated It

While Transparen had taken responsibility for the software and had that under control and working to the liking of the clients, we had still the issue of hardware, and while it was true that we could indeed blame the hardware-managed hosting provider for their mishandling of our server hardware, that was small comfort to us because in reality, people don't sue each other much in Canada for stuff like this, and we were only too aware that this incident would impact our clients, and would impact how people would view our services as well.

And besides, while it was a great idea to blame the hardware-managed hosting provider for triggering the problem (because it deflected attention away from our part in the problem), Transparen had been bad too - we needed to rely on the data recovery to get the data back, because the most recent backups were too old. As a company that provides data recovery, and regularly spends time talking about how important it is to do regular backups, we therefore have great sympathy for the many people out there who for whatever reason experience data loss not having adequate backups, because the true test of a backup does not come frequently, and everything else (i.e. projects, taxes, etc) may seem to have higher priority until the time comes when it's too late to do backups.

No More Blaming the "Managed Server"

But, after two back-to-back hard drive crashes caused by faulty refurbished parts, when it would really not have been too expensive to have new hard drives or even a simple mirrored raid array, it was time for the managed hosting provider to get fired, and we also took our negative feelings out on them by not paying their bills and by telling them that they had cost us far more than they were invoicing us, through their negligent and improper handling of our situation. Later, as the laws of tit-for-tat would have it, the situation inevitably repeated itself (roles reversed) with one of our clients doing that to us - but let us continue with our story.

As we alluded earlier, the advantage of having a hardware-managed hosting provider was that we could blame them for the hardware failure, so in part because we needed to begin to take responsibility for the hardware, and also in part to save face, we "fired" them by resuming all service from colocated server that we purchased from a smaller business downtown where we could speak directly with the president and establish a better relationship with more trust, so that if and when the next hardware malfunction would occur, we could have access to the server so we could fix the hardware ourselves.

Refreshing the Context

Now, it's important to remember that we started by using other people's web hosting services, then we purchased a hardware-managed server so we could have root access, and after the hardware-managed hosting provider failed to keep up their end of the bargain (to keep the hardware in good working order), we graduated to owning our own IBM Netfinity M10 and paying to colocate it with a small company downtown where we had the assurance that we could access and fix the server in the event of problems.

Relative Quiet Period

And, indeed, 6 months of relative quiet went by (aside from minor glitches such as people unplugging our server by mistake) before we had another hardware problem, but another one did occur, and here's how it went.

First Major Incident After Moving to Own Server

On a quiet Sunday afternoon, at 2pm, a call came in that the website could not be accessed, and could we check into it. We call our colocation provider, and sure enough they were on-site installing an air conditioner, and our server had been unplugged by accident. All of these 'unplugging' incidents were starting to sound like an old story, but at least the server could be plugged back in and it should boot up just fine. But, a few minutes later, the server was still not back up, and a call to the colocation provider confirmed that we were dealing with a problem. How are your backups, they asked.

Thankfully, the Backups Were Recent

The backups were in good shape. We had learned our lesson from previous experiences and the most recent backup was from the night before. We also had older backups. We picked up the backups and went onsite within 40 minutes, and then began troubleshooting.

But Troubleshooting Hardware Took Too Long - Days, Not Hours

But, as could be expected, the troubleshooting took some time. And it involved the 3 drive (+ 1 hot spare) RAID5 array that months earlier we had talked up as meaning that a drive could crash and the server could keep running, automatically rebuild the RAID using the hot spare, and then we could replace the faulty drive without taking down the server. This was the server that was supposed to never go down, and unfortunately, due to power cords being unplugged and now due to this additional problem, clients were getting the experience of a machine that would go down with a certain regularity. Not good. Definitely not good.

Rebuild the RAID

We rebuilt the RAID5, and it seemed to be fine, so we started the machine again. It booted!

FSCK

But there were file system problems - no problem, just run FSCK.

There were a lot of file system problems - but they did not seem to be affecting data areas very much.

The colocation provider was advising that if the file system problems were never-ending, or if they exceeded a certain cut-off point in time with no signs of slowing down, then we should give up and rebuild the system from a backup.

FSCK Seemed To Finish

But the file system problems eventually subsided. A few more passes, and the file system seemed fine....

And then, the RAID fell offline again. First one drive, then all of them. RAID5 is not supposed to collapse like that - it should be one drive that fails, and the others should stay online, so you replace the failed drive, and move on.

RAID Offline Again

One by one, we replaced the drives, rebooted into the RAID manager, and initiated rebuilds. Rebuilds would go for a while, and then collapse, and then we would try a different combination. The whole process was painfully slow. During waking hours, calls came in regularly asking for progress.

Swap the ServeRAID 3L card with a known working card. Same results. Swap cables. Same results. Swap the ServeRAID 3L card with one that's in an unused computer across the room. Same results. Swap it back. A technician suggests we replace the power supplies, which we hadn't done yet. Replace the power supplies. Same results.

Losing Hope - Rebuild From Scratch

Finally, there was no point in continuing the process - it seemed as if there was no hope in rebuilding this RAID5 array, so the drives were removed, and a 10 drive RAID enclosure was attached. The 4 drives were removed because there was still hope that we could recover the data, but it could not stand in the way of a server operating system rebuild. The show must go on, with or without that day of data.

The simplest way to configure the new setup would be just like the old setup, but the drives in the 10 drive RAID enclosure were not the right size to fit right into the Netfinity M10's 4 drive enclosure. So, a 3 drive (plus 2 hot spare) RAID5 array was configured.

It seemed fine for a few moments, but then it collapsed.

External Enclosure RAID Collapsing Too

Try again.

Remove some of the 10 drives - perhaps the scsi2 ServeRAID 3L can only handle 7 or so, because of extra spots taken up by the backplanes. Seems to work, for a while, but then the set collapses again. These are new sets - they should not collapse like this, but they do.

Try again, and it seems there's something stable, so our operating system technician begins the rebuild.

Server OS is Rebuilding - Meanwhile...

Meanwhile, there's an identical-looking computer at the other side of the room, sitting unplugged, not in use. For kicks, that computer is attached to a keyboard, monitor, and mouse, and the four SCSI drives attached to its 4-drive RAID enclosure are removed and placed in order on a table. The four "defective" drives are placed in order into this machine, and the machine is turned on.

The machine boots up, seems to notice that there are strange drives in it, and asks what to do:

  1. Ignore the error
  2. Use the ServeRAID card's configuration
  3. Import the configuration from the drives

We picked option 3, and damn, it seemed to work!

3 hours later, the RAID5 set had completely rebuilt itself!

Reinstall Rebuilt RAID5 in Original Server (to Avoid Buying New Server)

So, we interrupted the technician's work, and tried putting the drives back into the old machine, and after 4 attempts, concluded that the old machine must have a scsi backplane issue problem because it could not accept the drives, no matter what.

Buy New Server

So, put the drives back into the new computer. We began to use that one instead. The only problem was its location. It was physically located far away from the other server. Two technicians were working on-site, and another was working remotely on the operating system rebuild, but these Netfinity M10's are very very heavy. Since the SCSI cable we were using was long enough to go across the room, and the network cable could be made to stretch as well, temporarily at least, the new server would be left in place and we would rebuild it.

Resolve to Never Ever Ever Use RAID 5 For Mission-Critical Servers!

The RAID was allowed to rebuild.  Once it had rebuilt, not taking any more chances, we booted back into the operating system and started sshd to allow our technician to begin the server operating system install.  Meanwhile, time to take a look at what's left of the glorious RAID5 that we had so confidently trumpeted to our clients, only months earlier, as the availability solution that would prevent exactly this sort of downtime.  And... nothing.  It would not mount.

FSCKing RAID 5 Array, Once Rebuilt, Would Not Mount

At a time like this, calm is required.  Calm, rational thought.  One of the steps that we had been obliged to take was to remove all RAID arrays and create a new RAID array.  Of course the RAID5 might have been overwritten entirely... but it was more likely that the beginning portion may have been reinitialized.  So we ran a data recovery utility, and sure enough, it was possible to see the old partition.  And, after writing a new partition table, we were able to mount this partition and see the files. Yay.

Resolve to Never Ever Ever Use RAID5 For Mission-Critical Servers!

So the new server RAID rebuilt again, and the 10 drive RAID enclosure was attached, and configured as 5 sets of RAID 1 (mirrored RAID), because then we could always attach the enclosure straight to a scsi port (not a RAID card) and have access to the data with no further complications. RAID 5 has a lot of marketing gloss, but in reality it is much less redundant than RAID1, because in RAID5, only one drive can fail, whereas in RAID1, half the drives could fail. Also, recovering from a failure in RAID1 is simpler, and there is no worry about having to do a rebuild. The operating system rebuild resumed, and we decided to use ReiserFS instead of EXT3 for stability, and to build Gentoo for speed.

About Transparen's OS Software Setup

Transparen's server uses in-house kernel patches and Apache patches to support the security infrastructure that we favour (kind of a mix between what Windows does for permissions in NTFS and what Linux does), so those patches were ported, and we upgraded to the latest kernel, the latest Apache (we were on Apache 1, and moved to Apache 2, and the latest PHP - we were using PHP4, but now we use PHP5. To a software system administrator, these kinds of changes can seem quite risky, but they were surprisingly low-impact, and the upgrades have significant advantages.

Crash Took A Lot of Time

But the whole process took two days, and 3 days if you count basic email, and 4 days if you count things like IMAP and POP and all the details like that. So it was quite a significant amount of downtime. Two clients in particular were quite unhappy about this, and while one decided to stick with Transparen for hosting because Transparen is continually improving its hosting service and now has a second server with which to implement a FailOver redundancy plan, in addition to better understanding of what sorts of backups are needed to improve the best-case downtime in a situation of catastrophic hardware failure, the other one (with significant in-house experience including a system administrator who used to work for Transparen) decided to purchase and set up their own server and manage it themselves.

Clients Were Upset

Transparen sent them a small invoice to cover their share of improving the reliability and availability of the service (which was a pretty small amount to invoice considering that they had paid nothing for their mission-critical and special-access service for more than 9 months), and, saying "do you want to see our invoice?", they did not pay Transparen's invoice. Just like we did to our previous hardware-managed hosting provider after they had a hardware problem and we fired them, and probably for many of the same reasons.

Analysis of Problems and Solutions

Third Party Backup Was Not Done

One problem was that we were supposed to have colocation provider serviced backups, but these were never done. We received a full refund for them, however, and we were very happy with the colocation provider for allowing us to stay with the server and doctor it back to life using spare parts that they had pre-purchased. That would never have happened with a big company-purchased machine in a bigger data center.

Off-Site Backups Were Data-Only (Did Not Include Full System Backup) - But Easily Can Include Full-System Backup From Now On

Another problem was that the off-site backups, although they did cover user data, email, web folders, and databases, were not sufficient for a complete instantaneous recovery. It would have saved a lot of time if a complete system backup was available and ready to use. And, as it turns out, it is not very hard to create and maintain such backups remotely, including snapshots of specific dates, using rsync and hard links. So that's one part of our automated plan to make sure that the next catastrophic failure is not catastrophic for our clients.

RAID5 Backplane Issue Was Hard to Diagnose

Another problem was that pesky RAID5 backplane issue. It was far too complicated to diagnose, and the only fix was to use another server. So, we have the drives in an external RAID enclosure, and all of the data is stored on RAID 1 arrays (two drives with the same data. Writing data writes to both drives at the same time, but reading data reads from either drive, so writing is the same speed as one drive, but random-access reading can be faster). The data will not be stored on RAID5 arrays, except that we still have the old RAID5 set, which is currently unused, and that will be step 2 of our quick-recovery plan. Although it will not have the full system, the RAID 5 will have an active system backup of all critical system functions, so that in the event of a problem with the RAID enclosure, there is a whole other drive set that is ready, or almost ready, to take over. Also, nicely, it's on a different ServeRAID card, and has a completely separate backplane.

Power Was Unplugged Too Often

Another problem was the "power unplugged" issue. That was particularly embarrassing, because we really like this colocation provider, except for that habit of unplugging our power. When we went on-site to fix the server, we saw why unplugged power cords was such a problem. The server had been in a very high-traffic area, near the air conditioners and other servers with long power cords. The device we used to control our power remotely was sitting on the floor. The whole setup was too complicated and there were too many ways it could fail. So, we have moved the server to a much less trafficked area, its own corner, away from air conditioners, with its own dedicated backup UPS power supply, and made it abundantly clear to the colocation provider that unplugging incidents could not be tolerated.

Only One Server - We Need To Create Redundant Servers with Failover

Finally, there was the problem of having only one server. Unfortunately, there is no fool proof piece of hardware, and when servers go down, it can take time to get them to come back up. Sometimes, it can take days. Meanwhile, it would be great to make it so clients would not suffer. To accomplish this goal, and to make hosting more profitable, we are configuring the server we replaced as a virtual server server using operating system virtualization. This will be located in our office building (away from the colocated server) and it will keep a hot backup virtual machine running. The other piece of the puzzle is to have a load balancer directing traffic to the correct place in the event of one server having an outage, and DNS performing that function in case the load balancer fails.

One of the reasons that we had not done that until now was that the services we provide are almost always attached to databases, so it's never as simple as having a mirror copy of the files - the database must be synchronized differently, in a special way, and there is the issue of which database is correct - in other words, it takes a great deal of careful attention to set up the backup server correctly. However, the benefit is greater availability, and it's worth it, so we're moving forward with that.

No Control Panel

Well - this isn't actually an availability-related issue, but hosting resellers have told us that they really really need and like to have control over their services, using a control panel. They probably have in mind cpanel or plesk or some other control panel that takes over the server, which runs against our philosophy of what it takes to have a well-managed server.

Nevertheless it's an issue, so we'll either include control panels in one of the virtualized machines, or we'll have an in-house control panel that provides all of the features that resellers need to support their clients. In any event, this issue will be solved too, because it is one of the reasons why our server has been used mainly by our own clients, and not as much by resellers.

Most Recent Website and Regional Updates

 Web Writing Attracts Positive Results
Out-do the competition through skillful and colorful writing, aimed at both searchers and search engines. An effective message is worth many times its cost.

 
 RFI (Request for Information) Response Service
Outsource your RFI response process to Transparen's technical writing team in Surrey, British Columbia, Canada.

 
 Respond to More RFPs (Requests for Proposals)
Respond to more proposals by outsourcing some of your RFP response processes to Transparen's technical writing team, and benefit from Canadian technical writers who understand technology and its business applications.

 
 Proposals
Transparen provides quality proposals, but they are usually not free.

 
 PHP Programming
A Commercial, Off-the-Shelf (COTS) product may be sufficient. Otherwise a custom application may be appropriate. Transparen provides experienced professional guidance in COTS product selection and integration, as well as custom PHP application development

 
 Business Process Optimization
Description of how Transparen Corp. supports clients engaged in optimization of business processes.

 
 The Power Of XML/DTD
The XML file format provides an open, transparent, and standard way to communicate between systems. Document type definition (or DTD) provides a way for two systems to see if they can "speak the same language".

 
 Bayesian Techniques Assist Automated Decision Tools
Centuries-old techniques developed by Thomas Bayes find modern applications because they are simple and effective.

 
 Injury-prone Yukoners urged to take care
After a summer of rainy hikes and camping trips, Yukoners are returning to work and winter routines. As you resume these things, think about doing them safely, said Dr. Brendan Hanley, medical health officer who wants...full story

 
 SOURDOUGH STORIES: Fearless francophone paved a path to the Klondike
Credited with being the first white woman to travel over the Chilkoot Pass, Emilie Tremblay literally had to hike up her heavy skirts and follow the trail to go down in Yukon history.full story

 
 Whitehorse man missing
Yukon RCMP are trying to locate a missing person. David Layzell, a Whitehorse resident, was reported missing by family members on Monday. He was last seen 12 days earlier, on August 13.full story

 
 Yukon-born Olympian already looking to 2012
Olympic cyclist Zach Bell, a native of Watson Lake who now calls North Vancouver home, may not have an Olympic medal hung around his neck, but he has no regrets. In fact, when Bell, 26, looks back at his first Olympic...full story

 
 Harper’s opportunism knows no bounds
Stephen Harper’s minority government has an obligation to follow the nation’s laws — especially the laws it lobbied more than a decade for, drafted and passed in Parliament.full story

 

Google
 
Web transparen.com

Contact Information

Related Information

Linux and Open Source Strategy and Support
Linux provides strategic advantages and unlocks valuable new opportunities to integrate information inflows and outflows.
Remote System Administration
A master system administrator can manage hundreds of computers with ease, which means that companies with 20-30 computers and a full-time system administrator are not fully utilizing this individual's core strengths.
Finding MySQL Databases in Lost+Found
MySQL MYD and MYI and FRM files store the table name in the file name. What if the file names are lost due to hard drive partition table corruption? Transparen can help.
IT Support, System Administration and Data Recovery Prices
Transparen's service goes all the way to the most difficult tasks that other companies cannot complete.
Data Recovery Price Phone Call Scripts
If the drive is malfunctioning - send it directly to people who have invested in knowing what to do (i.e. Transparen). Do not gradually upgrade from the lowest quality service to the highest quality service - this diminishes the chance of success.
Octel and SCO X.25 Integration Troubleshooting in Toronto Ontario
Transparen's UNIX team can assist with a variety of networking and telephony issues having to do with Octel integration, X.25 networking, SCO UNIX troubles, and Ingres db recovery.
SCO or System V or SCO OpenServer Custom Server Tech Support
Many large organizations have legacy servers which have been in service for over 10 years, and no one is left with the expertise to fix problems when they occur. Near the end of their service lives, these servers will require support sooner or later.
Building a Service-Oriented Technology Consulting Company
Imagine calling a tech support line and reaching a human being - not a scripted one - who could solve your tech support issue on the spot.
High Availability vs. High Uptime
Availability is easier to manage than uptime, because uptime has to do with one computer, which might fail, whereas availability has to do with the service, which might be spread across many computers, networks, and power grids.
System Administration Computer Support
Transparen works closely with qualified clients to develop robust system administration procedures to minimize downtime, emergency response times, and costs...
Computer Support and Repair Shop Partnership Program
Computer support and repair shops can obtain better profitability by concentrating on what they do best. Perhaps it is selling, and perhaps it is service. Perhaps it is both.
Added Value Tech Support Plans for Entrepreneurs and Computer Support Consultancies
You can't be there all the time - but if you partner with Transparen, then you'll have a useful team of cooperative and highly skilled technical support people behind you, so if you take an hour off, your clients will still be supported.
Overnight and 2-day Computer Repair
April 2006 - Transparen Corporation, based in Surrey BC, launched GVRD-wide computer repair service with FREE pickup and drop-off of computers, overnight service, and two-day service. Ideal for business or individuals.
The Importance of Backups - Why Not To Rely On Data Recovery
Keep regular backups. Keep regular backups. Keep regular backups. Keep regular backups. Keep regular backups. Keep regular backups. Keep regular backups.
Data Recovery for System Administrators and Technical People
System administrators, don't send data recovery jobs to $400/job companies - they concentrate on the 88% of jobs that are super easy, that you could have done yourself. Transparen concentrates on the other 12%, and provides 90% success rate in 3-5 days.
   
 
E C M | © 2003-2007 Transparen Corp.