Mitigating PHP attacks with noexec file system mounts
Recently, I investigated a security-compromised Apache web server. This particular web server hosted 500+ web sites and had fallen victim to script kiddies / attackers. An investigation revealed that the attackers exploitedan old Joomla CMS installation that a customer failed to update. The attackers used the Joomla vulnerability to upload and execute a binary to /var/tmp.
An analysis of this nasty executable revealed that it was apparently capable of performing SYN floods and UDP floods as well as other nasty attacks. When launched, the executable connected to a private IRC server (active-sound.de:6667) where the bot received it’s orders. The Joomla installation was subsequently brought up to date, and the binary was removed. The owner of the web server realises that these problems are typical of an unmanaged hosting business, but never the less he would like to take steps to mitigate and/or prevent this happening again.
A simple and effective solution to this problem is to remount /tmp && /var/tmp file systems with noexec,nosuid options. If it is not possible to repartition the disk, you can mount these two directories as tmpfs. This protects against any backdoors or irc DDoS bots from actually executing thus preventing the server from falling in to the hands of attackers. This protection can be mean the difference between your server being rooted and/or being turned in to a DDoS / spam drone.
Highly Available Mail Cluster – v2
It has been some time since I last blogged about my quest to build a Highly Available Mail Cluster. If you recall, the last HA Mail Cluster architecture that I designed involved four identically configured servers, spread between two data centres utilizing DNS round-robin load balancing. The Maildirs were rsynced from the ‘master node’ to the three slaves every 5 minutes. MySQL master-slave replication was used for user authentication. This worked “well enough” but it wasn’t real High Availability and it meant that, at any given time, three out of four Mail nodes were passive (idle) – wasting resources.
I decided to design a brand new platform to deliver a Highly Available Mail Cluster. The platform consists of two domU nodes in active/active configuation. Each node utilizes a DRBD block device in Primary/Primary mode. OCFS-2 is the clustered file system which sits on top of DRBD. This allows both nodes shared-concurrent access to the Maildir directories.
Both nodes run dovecot for pop3/imap and postfix for smtp. Each node has an equally weighted A/MX record for SMTP / POP3 / IMAP load balancing. The load balancing is still performed by DNS, utilizing two IP addresses in the A record. The beauty of this active/active heartbeat setup means that in the event of a server failure, the IP resource of the failed server will be taken over (via heartbeat) by the other mail server. This means there is virtually zero chance of a user hitting a stale IP address in the DNS A record. I have noticed that when users are constantly checking their inbox (pop3/imap) every 5 minutes or so Outlook caches the DNS entry indefinitely, regardless of TTL.
The above solution is working very well. I did have a couple of initial concerns regarding the stability of DRBD/OCFS-2 within a Xen domU – but I have had no problems to date. Overall the entire solution appears to be very stable.
The architecture diagram below (servers on right) shows the architecture. The full size image can also be found at: http://napta2k.googlepages.com/linode-v2.png
The drawbacks of this solution is that it does not scale past 16 nodes. The scaling limitations are due to how many cluster members can be part of Heartbeat, DRBD and OCFS-2. Thankfully I only host 400 or so mailboxes and can never see the need to scale to any more loads. One domU handles my current load just fine. The two node active/active could easily be active/passive and work just as well.
If I was going to implement a large single-data centre Highly Available Mail Cluster I would use the following architecture:
1. CARP IP address layer to distribute the load to the TCP load balancers <– probably only needed for the largest setups. You may have 16 load balancers on your front line but you definitely do not want to have 16 IP addresses in your DNS A record. CARP masquerades this.
2. TCP load balancer layer such as HAProxy to load balance the IMAP/POP3 traffic to the IMAP/POP3 farm
3. A farm of dovecot servers all sharing a _resilient_ NFS backend of Maildirs
4. A resilient NFS architecture. This could be as part of a SAN (e.g. EMC celerra) or Linux iSCSI/DRBD.
x. SMTP can be load balanced via MX.
Optimizing VBulletin for a VPS – part 1.5
I have modified my VBulletin config file and enabled the use of APC as a VBulletin datastore. Smokeping is now reporting latency of 90-95ms. Not an immediately noticeable improvement but the average load on the server is 0.00 0.00 0.00 even with 100,000 hits per day. The performance improvements should be more measurable as the load increases.
To configure APC as a VBulletin datastore I simply uncommented the following line from includes/config.php:
$config['Datastore']['class'] = 'vB_Datastore_APC';
Optimizing VBulletin for a VPS – part 1
I run a small-medium VBulletin based web forum that receives a modest 110,000 hits / 2,000 – 3,000 unique visitors per day. I run this forum from an even more modest Xen-based VPS from Linode. Originally the forum started out on a big dedicated machine with 2GB of ram and a beefy processor, running VBulletin / Apache 2 mod_php / FreeBSD. I wasn’t happy with this solution, Apache 2 / mod_php could easily consume 2GB of ram due to the prefork MPM and the notion of running a complete PHP interpreter in each Apache process. I was convinced that it could create a much more efficient platform to host the VBulletin forum.
I decided to move the forum to a more modern, efficient, platform consisting of a Xen-based VPS from Linode, Debian GNU/Linux, and lighttpd. Not only is lighttpd measurably faster than Apache, using a VPS allows me to attain higher availability by default since most VPS servers (atleast at Linode) are usually of better specification (quad-core, RAID-1, dual PSU) than your low-end single disk dedicated server. Linode also allows me to rapidly deploy VPS instances in different data centres, and create HA/failover solutions. A VPS is also substantially cheaper than a dedicated server. The average dedicated server costs $150-$250 per month – the average VPS costs $20 per month. Win!
Whilst lighttpd and it’s FastCGI architecture based PHP happily serve out over 110,000 hits per day, running ApacheBench against the forum revealed that the server would max-out at serving 12 concurrent requests of the forum per second. Although due to the nature of HTTP and Web surfing users do not notice any problems. To gather a better understanding of how long the page took to load I installed smokeping and echoping. Smokeping reported that the forum took 150ms to load. This accounted for the 12 requests per second that ApacheBench reported. Not good… but at least I had a clearer picture of how long things were taking to load.
In an attempt to further optimize the forum I installed the XCache PHP Accelerator. Smokeping showed a measurable improvement of 50ms, taking the forum load time from 150ms to 100ms. Although XCache was working perfectly out of the box I decided to compare it with PHP-APC. After I installed PHP-APC smokeping reported the same drop from 150ms to 100ms, and the APC admin URL reported a 98% cache hit rate within 20 minutes of running, and with a default setting of 30MB of cache. Overall, a satisfactory performance improvement.
Below you can see the two ‘dips’ in the graph where adding a PHP accelerator improved VBulletin performance. The first dip is XCache, the second is APC:

Debian: /var/backups
Deleted your /etc/passwd{,shadow,group} file(s) ? No problem! Debian keeps a backup copy of your gshadow, shadow, group and passwd files in /var/backups!
Hurray!
De-Toxing day!
Today I decided to go out and purchase a digitial photo frame as a parting gift for my friend. I would usually only make such a purchase on the World Wide Web (e.g. Amazon) because of the bigger range and lower prices but I needed this gift for today as my friends are leaving for London tomorrow.
I offered to lend my friends a hand with carrying their luggage to their new apartment in London tomorrow, and as it turns out they booked me a ticket to stay with them all weekend – returning on Sunday. So it looks like I am spending the weekend in London.
For reference, the digital photo frame is a Sony DPF-V700B . It even has a remote and a HDMI connecton, woohoo!
” …”
One year finishes as another year readies. Looking back, 2008 has certainly been one of the most important years of my life. I have loved and lost a person who was very important to me, but I have also learned a lot about my character, and who I am.
For me 2008 started with a call (on Skype never the less) from my then girlfriend explaining that she is leaving me for another person. This hit me harder than Thor’s hammer. I then had to go to London for work reasons and spend a week in a hotel alone. Dark times indeed.
As the saying goes: “Time is a (the?) great healer” indeed things did get better as time goes on. I had been with a partner for such a long time that I had forgotten what it feels like to be single and free. It was such a released burden to make decisions which are entirely for your own interests, not having to consider your partner.
I realised that Thor’s hammer is indeed all powerful and I was still somewhat filled with darkness and rage. I decided that I needed a way to channel and release my anger and decided to take up running and fitness.
Up until this point I was somewhat unfit, working in IT and never really doing any cardio in this millenium. This was going to be a challenge.
The idea that my running was fueled by rage give me confidence in what I was doing. Being a sysadmin, and therefore logical, I decided it would be a good idea to first come up with a training plan and a way to monitor my goals and progress. I developed a fitness training schedule and purchased a Nike+ sport wristband to monitor my running. God this felt good. re-taking charge of my own life felt extremely self-satisfying.
I started out with simple running goals of one mile per session, then two miles, then 5k, then 10k, then 20k, and so on. I decided to focus on consistent mile times rather than all out distance. It was one thing to run a mile in 5:59 flat, but how about running 10 miles at 5:59? a challenge!
I decided that I wanted to incorporate weight training in to my exercise plan. This meant that I had to structure my diet to take onboard as much protein as possible. Unwanting to eat a ton of meat every day I decided to purchase Whey Protein powder from Maximuscle. I also decided to lower my intake of carbs (which argubaly curved my ability to run) and instead focus on burning excess body fat.
Within several weeks I felt several times stronger (with visible biceps, rawr) and a lot less body fat – in preperation for that six pack abs.
My work was going well too. The company that I work for build and refurbish power turbines. They have done for a very long time. They have standard well-defined design tools and processes. Unfortunately these tools insist on a legacy Unix environment. The engineering departments currently use Solaris 2.6 for production work. As the Unix system administrator it is my job to support this environment. or atleast the server side.
At first glance this is a very bad thing from an IT perspective. A reliance on proprietary legacy applications means keeping a legacy operating system which means keeping legacy hardware to run said operating system – eventually all of this will fall down and the company will suffer.
After 1.5 years of arguing with the IT managers, the business, and engineers, trying to make everyone understand the risks and implications of a legacy environment like ours nobody is budging. I have decided to accept that the people which whom I am lobbying against have no interest in shaping the future, or moving forward. I secretly think their strategy is to retire and it won’t be their problem any more.
With all that in mind I did some reasoning and thinking and I decided to do what I can to help preserve this legacy environment for many years to come. First of all, the environment is stable and frozen. The only data to change is the user application data (e.g. CAD files).
Secondly all hardware, although legacy, is under maintanence so it can be kept running almost indefinitely. Even if the server caught fire, it isn’t impossible to buy legacy Sun servers to reload Solaris 2.6 on.
Thirdly, the entire Unix environment (NIS) is rather cleverly served from the NAS (EMC Celerra). This means that to rebuild the Solaris 2.6 production servers all that is required is an OS reload and the server to be bound to the engineering NIS domain – all data is served automagically via NIS automounts.
Fouthly, I decided to rip out any legacy complicated infrastructure that may hinder rebuilding this legacy environment. Risk notifications were sent out, and disaster recovery plans created.
At this point people will probably be thinking why Solaris Containers or technologies like Transitive weren’t employed. The truth is, they were, and they were deemed unsuitable for the job. The Solaris 2.6 applications did not work entirely in Solaris 8 BrandZ or Solaris 10.
Transitive is basically Solaris 10 anyways, and although fine for running single applications, it proved to be too much of a niche in my absence – why should I be a single point of failure for my company? Although it would keep me in a job I guess.
Fast forward to the end of 2008 and things are great. I have met a special person, and planning an exciting 2009 in a new location.
Merry Christmas && Happy New Year!
Analyzing Solaris 8 and Virtual Memory behavior
Recently a concerned application administrator asked me why our Hobbit monitoring program was reporting that our Solaris 8 Broadvision portal servers are using a high amount of swap space. I decided to perform a little analysis to get a better understanding of what was happening and, more importantly ascertain if the swap usage is causing any performance problems.
The server in question is a Sun Fire V440 with 8GB of physmem, 8GB swap and two UltraSPARC III 1.2Ghz processors.
The swap(1M) command shows that there is over 6.4GB of swap used. The df(1M) command shows that /tmp (which uses swap for storage) is virtually empty too.
# swap -s
total: 4633816k bytes allocated + 2120280k reserved = 6754096k used, 8108600k available
# df -k /tmp
Filesystem kbytes used avail capacity Mounted on
swap 8110168 3592 8106576 1% /tmp
The prstat(1M) command shows some interesting things:
- All processes are in a sleep state. This means that they are likely to have their Last Recently Used (LRU) memory pages “paged out” to swap. This paging is done by the Solaris scheduler (sched)
- Combined, the processes are requesting more memory (13GB) than that of which is physically available (8GB) to the system. Thus pages of the processes’ memory are ‘paged out’ to swap. This would bring the swap usage to around 5/6GB – as evidenced by swap(1M). Remember that in Solaris “virtual memory” is composed of both real RAM and SWAP which means this box has 16GB of virtual memory (8GB physmem 8GB swap)
# prstat -a -s rss
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
14808 bvstg 569M 427M sleep 58 0 0:00.00 0.7% bvsmgr/18
2090 bvqa 822M 357M sleep 58 0 0:25.47 0.0% contentdistribu/5
14734 bvstg 370M 269M sleep 50 0 0:37.24 0.0% bvsmgr/16
14802 bvstg 355M 259M sleep 58 0 0:55.39 0.0% bvsmgr/16
14626 bvstg 346M 252M sleep 58 0 0:00.02 0.0% bvsmgr/16
2123 bvqa 280M 249M sleep 58 0 0:13.25 0.0% procserver/6
328 bvqa 330M 222M sleep 38 0 0:00.37 0.0% bvsmgr/16
317 bvqa 309M 190M sleep 58 0 0:00.01 0.0% bvsmgr/15
12755 bvqaprod 322M 169M sleep 58 0 0:00.02 0.0% bvsmgr/14
28147 bvqa3 315M 133M sleep 59 0 0:00.13 0.0% bvsmgr/14
2016 tomusr 338M 124M sleep 10 0 0:00.03 0.0% java/27
2047 tomusr 332M 117M sleep 58 0 0:00.03 0.0% java/27
2061 tomusr 326M 111M sleep 0 0 0:00.03 0.0% java/26
2029 tomusr 326M 111M sleep 21 0 0:00.03 0.0% java/27
15358 bvstg 148M 98M sleep 58 0 0:00.08 0.8% java/11
28072 bvqa3 288M 70M sleep 58 0 0:00.00 0.0% bvsmgr/13
14350 bvqaprod 83M 65M sleep 58 0 0:00.01 0.0% genericdb/6
15372 bvstg 109M 59M sleep 58 0 0:00.02 0.1% java/12
14433 bvqaprod 285M 58M sleep 58 0 0:00.03 0.0% sched_srv/12
4017 bvqa 281M 54M sleep 58 0 0:00.02 0.0% sched_srv/12
20370 bvstg 281M 52M sleep 58 0 0:00.01 0.0% sched_srv/12
26662 bvstg 92M 48M sleep 53 2 0:00.00 0.0% java/12
2104 bvqa 56M 47M sleep 59 0 0:00.11 0.0% frtsobj/22
14767 bvstg 55M 39M sleep 58 0 0:01.03 0.0% cntdb/7
14641 bvstg 57M 39M sleep 58 0 0:01.02 0.0% cntdb/7
28085 bvqa3 54M 38M sleep 58 0 0:01.20 0.0% cntdb/7
233 bvqa 55M 37M sleep 58 0 0:00.20 0.0% cntdb/7
12767 bvqaprod 54M 36M sleep 58 0 0:00.08 0.0% cntdb/7
477 root 37M 32M sleep 58 0 0:00.00 0.7% aex-pluginmanag/16
20498 bvqa 78M 28M sleep 58 0 0:00.00 0.0% java/12
NPROC USERNAME SIZE RSS MEMORY TIME CPU
71 bvstg 4107M 2015M 25% 1:52.28 0.1%
53 bvqa 2980M 1449M 18% 5:24.40 0.0%
94 bvqa3 4013M 1022M 13% 0:12.24 0.0%
4 tomusr 1323M 463M 5.9% 0:00.12 0.0%
17 bvqaprod 994M 400M 5.0% 0:13.41 0.0%
44 root 194M 122M 1.5% 2:10.51 0.2%
12 bb 13M 11M 0.1% 0:05.20 0.0%
1 daemon 2584K 1776K 0.0% 0:00.00 0.0%
To make sure that these processes were not all sharing a large amount of shared memory (shmsys) the pmap(1M) command was used. The results show that most of the address space is Private.
# pmap -x 14808 14808: bvsmgr p_1221_32 -install_name bvsmgr/sarug31532/ccwstg/BV_SessionMana Address Kbytes Resident Shared Private Permissions Mapped File 00010000 48 40 40 - read/exec bvsmgr 0002A000 24 24 - 24 read/write/exec bvsmgr ... <snip> ... FF3E2000 8 8 - 8 read/write/exec ld.so.1 FFBE6000 40 40 - 40 read/write [ stack ] -------- ------ ------ ------ ------ total Kb 558488 530304 111344 418960
To get a sample of freemem/vs freeswap sar(1M) was used. The sar(1M) output shows that the amount of free memory pages is increasing (reclaimed)
# sar -r 5 5 21:04:04 freemem freeswap 21:04:09 293980 16213779 21:04:14 294135 16216864 21:04:19 294136 16216864 21:04:24 294136 16216864 21:04:29 294725 16249363
Average 294222 16222736
The output of the vmstat(1M) command shows a couple of interesting things:
- No processes are in a RUN/BLOCK/SWAP state (evidenced by prstat(1M) above)
- The reclaim column shows that free memory is being relclaimed
- The scan rate column shows no activity. Presumably because all processes are in sleep state.
- The disk/md3 column shows that no I/O activity is happening on the swap slice
- Low activity is occuring within the minor faults column. This suggests that the application is requesting data from pages of memory which have been ‘paged out’ – but this is occurring at a very low rate. Probably due to the low overall load of the box.
# vmstat 5 5 procs memory page disk faults cpu r b w swap free re mf pi po fr de sr m0 m1 m3 m5 in sy cs us sy id 0 0 0 9647152 3462488 161 513 246 3 2 0 0 0 0 0 4 482 482 559 5 3 91 0 0 0 8108784 2353456 37 104 0 0 0 0 0 0 0 0 1 345 5054 6750 0 2 98 0 0 0 8108784 2353456 32 79 0 0 0 0 0 0 0 0 0 357 5050 6793 0 1 99 0 0 0 8108784 2353456 32 79 0 0 0 0 0 5 0 0 0 385 5052 6848 0 1 99 0 0 0 8108784 2353472 93 247 0 0 0 0 0 0 0 0 0 349 6593 6975 1 2 98
Note: vmstat -S was not used.
Summary
Inactive pages are being swapped out because the box does not have enough physical memory to store them all in RAM. Although even if there was enough physical memory inactive pages would still be swapped as soon as freemem fell below lotsfree. The paging is encouraged by the fact that all the processes are in a sleep state and the load is low.
As long as the load does not suddenly increase exponentially it is not necessary to increase the physical memory.
How to add a SCSI disk ‘on the fly’ with RedHat Linux
Today I provisioned a new RedHat VMware ESX 2.5 guest at the request of the web hosting department. They requested a standard build which consists of a 9GB disk for the Operating System. They also requested an additional 80GB disk for data. The standard build consists of (only) OS on local disk, with application data on the SAN. This is build is platform agnostic and thus applies to both physical servers and VMware guests.
The normal procedure is to attach all of the storage while the VMware guest is powered off, but today I used this as an opportunity to get RedHat to add the 80GB SCSI data disk on the fly.
I powered up the VM and installed the OS according to the build specification. I then attached the 80GB SCSI data disk while the VMware guest was running. VMware reported this operation a success. The next task was to get Red Hat to “re-scan” the SCSI bus and detect the new disk.
The command: echo “- – -” > /sys/class/scsi_host/host0/scan worked without error. I then ran “fdisk -l” and immediately I was presented with a new /dev/sdb disk ready to be configured.
NOTE: The Red Hat Knowledge Base (RHKB) states that the recommend way to add new SCSI disk under Red Hat Linux is to switch off the server before attaching the new disk. The above procedure was tested on RedHat EL ES 4.6 (Nahant)
Building a Highly Available Mail Cluster
Update: I have updated my Highly Available Mail Cluster Architecture.
Checkout: http://ajclark.wordpress.com/2009/03/05/highly-available-mail-cluster-v2/
In my spare time I look after a mail cluster for a small-ish hosting / consulting company. The mail infrastructure used to consist of one physical dedicated server which was rented by the consulting company. It was my job to manage this server. The server was basically a FreeBSD box running Plesk 7.3 which worked very well for a year or so.
Unfortunately with the rising costs of energy most dedicated / colocated server ISPs are raising their prices. I know Layered Tech are one such company that did this and angered many of their customers. The consulting company were faced with a decision, either pay the increased ISP prices or migrate to a virtual Xen based platform which was a fraction of the cost and could be made Highly Available.
Design goals
Since this is a new project I felt it was important to have design goals.
- Keep it simple.
- Keep it consistent.
- Use stock software packages – All system builds should adhere to Debian release configurations.
- Platform must scale.
- KEEP IT SIMPLE!
The advantages of a Xen platform
The plan was to purchase for Xen domU “linodes” systems from the ISP linode.com. Linode are a great ISP who provide Xen (and previously UML) domUs across four different data centres in the US. Their ingenious control panel lets you quickly provision a Xen domU in a data centre of your choosing, and/or clone/migrate existing domUs. The advantages of using a Xen platform:
- Cost saving – Able to purchase four domU linodes for less than the cost of one dedicated server.
- High Availability – Linode assist in HA by placing your domUs on different physical servers. They also allow IP takeover.
- Simple management – Using Debian as opposed to FreeBSD allows us to quickly patch and update the OS.
- Efficient resource usage - A single “dom0″ physical server hosting eight domUs is more efficient than a single dedicated server hosting Web / Mail. This also gives us that ‘Green Energy’ feel good factor.
The Mail Cluster Architecture
The Mail Cluster design was based on Linux-HA & Heartbeat. There would be a total of four Xen domUs in two different data centres. In each data centre there would be a primary (active) and secondary (inactive) domU. All domUs would have identical configurations. There would only be one Xen node serving Mail services at any one time. This means that three domUs are effectively inactive until they are required.
Failover Architecture
Should the primary server DC1-SVR1 fail DC1-SVR2 will instantly take over using Linux Heartbeat. This provides a near sub-second failover. Should the entire DC1 go offline DNS will be manually switched over to DC2-SVR1. Primary DNS is hosted on DC1-SVR1 and slave DNS on DC2-SVR1. All DNS TTLs are 300 seconds.
- DNS is kept in sync using standard master / slave configuration.
- MySQL is kept in sync with master / slave replication (1 master, 3 slaves)
- Web htdocs are kept in sync with rsync at regular intervals.
This architecture gives us protection against both a physical server failure and a complete data centre failure.
The failover architecture was designed on the principle that it is not acceptable to have a multi-hour server outage (e.g. physical server failure) but it is acceptable to have a 10 minute outage while DNS fails over (e.g. complete DC failure)
Mail Cluster Software Components
One of the key goals of the fail over design is to make as many applications as possible use MySQL to store data. This simplifies data synchronization.
- Debian stable
- Apache
- Postfix – Using MySQL for all maps
- Dovecot – Using MySQL Postfix database
- MySQL – Replication enabled. Master: DC1-SVR1, Slaves: DC1-SVR2, DC2-SVR1, DC2-SVR2.
- BIND
- Roundcube mail
- Postfix Admin
- Bindgraph
- Mailgraph
Future improvements to the Mail Cluster
The main thing that bugs me is that there are three domUs that are largely inactive, except for MySQL and htdoc replication. This complies with the number one design goal of keep it simple, however, I can’t help but feel this is wasteful.
I would like to replace the rsync data syncs with iSCSI. Each set of servers could fail over to an iSCSI target within their respective data centres. This would be more efficient than periodic rsyncs from cron.
