Take This Cron Job And Shove It
Earlier this week, I began migrating all of the sites hosted by Synfibers that were using the PHP-based CMS called Drupal from version 4.2 (and one site that was hosted on a modified 4.0 installation) to the latest and greatest edition, 4.3.2.
For the most part, the process was pretty painless. Unlike some previous upgrades, there were very few times I had to manually alter a site's database to get the new code tree to work, and most of the time these were mentioned explicitly by the upgrade script. The first day I did one site, and let it sit overnight. When everything seemed to be fine, I did five more sites the next day.
That's when everything seemed to slow down to a crawl, about every fifteen minutes or so. I thought it was the scheduled cron job that updates the database with XML newsfeeds and so on; I figured some of the newsfeed URLs were bad or had malformed XML, so I started weeding out the feeds, removing sites that had invalid XML or sites that didn't seem to have active newsfeeds anymore.
Then Verio tech support emails me, warning of dangerously high CPU usage in some Apache webserver threads, and suggests that it might be due to Googlebot. I thought at first it was again the cron jobs and that he just wasn't familiar with my usage of Drupal and the need for these cron scripts to be run frequently. However, as it turns out he did, and he had already dismissed the cron jobs as a potential cause because the times when they ran didn't seem to match up with the times when these high-CPU-usage Apache processes were spawning.
So that meant that either the search bots were the culprit, or somebody was DDOSing Synfibers. That seemed unlikely, so I started changing some things. I used robots.txt to exclude cron.php from their view, figuring that perhaps the bots were hitting that page and causing multiple cron script instances, which might bog things down. I took a look at top on the server and noticed some Apache processes going as high as 95% CPU utilization, in effect bringing the server to its knees.
That change seemed to have no effect. So I blocked all the Drupal folders with robots.txt files. Also no impact; but some bots I know ignore this file anyway, and I have no idea how often they even check for it.
Verio tech support suggested I temporarily disable all the scheduled cron jobs and watch for CPU hog Apache processes, which I did. Even with all the crons disabled, they still showed up, seemingly at random.
I ran the cron script for each site manually, watching to see how long they took to run and watching the CPU usage. They all ran flawlessly save one, which I fixed shortly. None took more than 30 seconds to run (allowing me to drop the max_execution_time in php.ini from the outrageously high values I'd been forced to use before (3000 seconds) to a much more sane 45 seconds.
None of the cron jobs was the culprit; CPU utilization never got above 5%.
Then the tech support guy suggested using the server-status Apache module to get more detailed information on exactly what pages were being loaded and from what IP address by these errant Apaches, and added that he hadn't been able to get the module to work on my server-- could I please disable a few PHP extensions to free up some memory?
That's where it all came together.
I had no idea what this module was or what it did, but thinking about Apache modules that weren't working for no apparent reason at all made me think about a note on the Google pages about GoogleAds, regarding things that their search engine might not like, that might lead to insufficiently targeted GoogleAds for a given site.
See, search spiders like Googlebot that try to index entire sites by following all the links on every page don't like session IDs, don't like session cookies, don't like dynamically generated sites, and don't like URLs with characters like '?' or "=" in them. That pretty much defined Drupal to a T.
In version 4.2, two new features were added to Drupal called Clean URLs and URL aliasing. Clean URLs was added to the Drupal core with 4.2; URL aliasing was done by a contributed module in 4.2 and came into the core in 4.3. In version 4.1, Drupal URLs looked like this:
http://synfibers.com/node/view/10/
In 4.2 without Clean URLs on, they looked like this:
http://synfibers.com/?q=node/view/17
An improvement, as the '=' character was removed. However, with Clean URLs on, the same URL looked like this:
http://synfibers.com/node/view/17
Now the '?' has been removed, making this look more or less like a URL on a server that isn't generating dynamic content at all, except for the missing ".html" extension.
And with URL_aliasing set for that page, you can have a URL like this:
http://synfibers.com/design
Now, not only does the URL not have any special characters in it anymore, but it's human-comprehensible the same way that static .html-based sites are.
The rub, of course, was that all this cool functionality depended on an Apache module that I hadn't used before called mod_rewrite. This module allows the server to take a URL requested by a client and "rewrite" it to a different URL and serve up a different page, all without the client's knowledge. So the internal URLs can have special characters and node id numbers, while the client just sees a nice 'clean' URL-- hence the feature's name.
I had tried playing with Clean URLs when Drupal 4.2 came out, mostly because the drastic change in the way URLs functioned, removing the names of individual scripts like "node.php" and "index.php" and "module.php" (which you still see often around the web on installs of other CMS, like Moveable Type or PHPNuke) and instead just used the generic "?q=" and then slash-delimited terms to identify which module was required and which function was being performed.
But that meant that every external link to your site, from another site or from a search engine, was horribly broken. For that reason a third feature that used mod_rewrite was introduced, a filter that would automatically redirect the old-style "node.php?id=X" URLs to the new style URLs or to a clean URL.
I wanted to use this feature so I wouldn't lose traffic from external sites and search engines. Unfortunately, I couldn't get mod_rewrite to work at all. It was enabled in httpd.conf, it showed as being active in phpinfo(), and seemed to be configured as required by Drupal in .htaccess. I tried moving those configuration directives around from the .htaccess files to the httpd.conf file. Didn't seem to matter. I emailed Verio tech support, and while they said they didn't support mod_rewrite explicitly, it seemed that I had set everything up correctly and they didn't see why it didn't work-- but it clearly didn't. Enabling Clean URLs on any of the Drupal sites caused the site to stop working completely; nothing would load at all at any URL.
I stopped trying and that's where it was left, until this other Verio tech support guy suggested I try turning off a few of the PHP extensions I had loaded in the hopes that it would free up memory, supposing that this was the reason why server-status wasn't working. I disabled a few unneeded extensions and waited for him to explain to me what he was going to do with server-status. But in the meantime, I figured this must be why mod_rewrite wasn't working as well. So I went back and tried turning on Clean URLs.
They worked.
I turned Clean URLs on for the three sites that Verio had indicated were being accessed when the Apache processes started to CPU hog. Those processes just went away. Pretty soon I had switched the feature on for all the Drupal sites using 4.2 or 4.3, and had begun adding URL aliases to commonly-used pages and editing custom block code to link to clean URLs instead of the old one.
I continued to watch the CPU usage-- no Apache process ever got above 25% after that, and even that only lasted a couple of seconds.
I emailed tech support and thanked them-- in return for which they thanked me for helping to track down the problem and apologized-- he still hadn't been able to get the server-status module to work. But then again, I can't disable everything.



