Years ago, faced with the need to build and manage a number of content websites, I developed a novel content management system akin to what Joomla or Drupal are today. I've always run my own servers so I could have chosen any development language. I briefly considered Java. However, suffering from delusions of grandeur, I wanted to make certain that whatever I built was as widely installable as possible. I envisioned building a successful open source platform that I could then build a services company around. Low cost hosting featuring Apache, PHP and Mysql were abundant. Many people getting into programming were learning PHP. My feeling was PHP would take off, so I decided to adopt that language despite thinking it smelled funny. I second guessed myself, as I am want to do, and feared putting in huge amount of work only to have the platform it was built upon become non-viable. I knew the majority of the work would actually be in building the forms and views that made up the system. To hedge my bets, I created a pedestrian XML "web component description" language that allowed me to describe the forms, views, validation, and business logic. This isolated me from relying on PHP too much. I called this language the formVista Markup Language or FVML. Honestly, I always thought the PHP implementation would just be a prototype. My thinking was that eventually I would just rewrite the parser in some other better language because obviously only a madman implements a web development language in PHP.
Reimplementing the parser never happened because for all the projects that I needed, this codebase ended up being "good enough". I still prefer it to Wordpress, Joomla or Drupal. I decided to try and build Miles By Motorcycle on it because I still prefer it to those other platforms. (I really tried. I even installed Wordpress. I did a project in Drupal and even evaluated Diaspora.) In retrospect, I probably should have chosen other technologies and approaches since M-BY-MC doesn't need to be "installable" in the way the code needed to be before. Unfortunately, M-BY-MC highlights just how badly my legacy platform performs when stressed.
Pages are loading wickedly slowly and it's primarily in evaluating expressions in my FVML language because PHP's string handling is so slow. I really didn't want to rewrite the expression parser in C++. As a fallback I was going to write a "compiler" for FVML to translate it into static PHP instead of building the parse trees on the fly.
Facebook had an analogous problem. Programmers who only know PHP are relatively inexpensive to hire. Facebook hires bunches of them. But there's a cost in terms of hardware. Facebook has to run more hardware as a result of the inefficient PHP stack. So Facebook embarked on the insanity of building their own from the ground up complete PHP stack in-house. Originally this took the form of HipHop, a PHP to C++ translator which yielded around a factor of 5 performance improvement. The problem with HipHop was that it was a different development process. You couldn't just edit your files and immediately see the result. You had to compile and re-run.
So, expanding on their insanity because we all know it's simply not possible for any mortal to accomplish, Facebook boldly took the additional step of turning their PHP to C++ translator into a full fledged PHP to Native Binary Just In Time (JIT) compiler. This is nuts. It even features an integrated web server. Now, the development process was the same as with the original PHP stack. Just edit a file and load the page in your browser.
The original PHP in Apache uses up a bunch of RAM. A separate fork of Apache handles each page request. The PHP interpreter is constructed, the PHP files are loaded, compilied into bytecode, the bytecode is intepreted, output is generated and PHP is torn down again on each request. You can get it to skip the compiling into bytecode step for some modest performance gain.
HipHopVM is vastly better designed in that it's a multi-threaded long lived single instance. It loads everything once and then just checks to see if it's changed. After running a given PHP file a few times, it then compiles it to native binary code which is directly executable. This code is cached. It's only updated if the source PHP file changes. This alone represents a huge performance gain.
One of the developers told me that because they run the code first a few times they can do some profiling of a sort to actually generate more optimized binary code than a static compiler can in some if not many instances. There were some cases where PHP code run in HipHopVM ran faster than the equivalent C++ code.
We know this is not possible, but they have done it anyway.
Of course, I knew there was simply no way in hell that HipHopVM was going to be able to run my codebase. But I was curious to give it a try to see how badly things broke.
It's a herculean effort to create something with parity to the massive number of extensions written for PHP. However, I was impressed by how many extensions they already supported. Luckily for me, because I was suspicious of PHP I limited my reliance on extensions whenever possible and this served me well as HHVM already supports all the extensions I need: mysql, gd, session, pcre, posix, xml, hash, imap, openssl, exif
I pulled down the binary distribution. I couldn't figure out how to get the rewrite rules working, so I just decided to see if it could run the setup script, parts of which were itself written in FVML which means the entire parser and all that support code would get tested. There was no way this was going to work.
Imagine my shock and horror when HipHopVM executed all the pages of my setup script including setting up the databases flawlessly!
Needless to say I was incredibly impressed.
I searched around github and on the web to get more information about the weird configuration file format HipHopVM uses. There wasn't much out there but with some trial and error I managed to get the public homepage of the site to load and was able to navigate around the site. Impressive.
It wasn't all trouble free. The first problem I ran into was a HipHopVM crash on a certain page. I joined the IRC channel where it was said the developers tend to hang out. I lurked for a while to see how the conversations went. The instructions there are to type hhvm-help: with a question to get their attention.
While I didn't really expect a response, I did give it a try and to my great surprise I was immediately chatting with someone from Facebook who was taking my crash report very seriously.
I have to say everyone I have chatted with over there has been extremely nice, professional, and helpful. Despite being busy and deluged with lots of help requests, they were always pleasant and seemed honestly interested in helping. I have to say I appreciate it.
They gave me a few tests to run which required me to pull down the source and go through building the environment. This took quite a bit of time. Then came loading hhvm in GDB and giving them dumps. Interestingly, there are some options which you can provide when building which will enable GDB to display the PHP source line an HHVM crash happens on. You have to configure for debugging using cmake -DCMAKE_BUILD_TYPE=Debug . Then when you run the server add the options -vEval.SyncGdbChunks=1 -vEval.JitNoGdb=false
Dutifully, I went back and reported my findings. We went through a few more interations and they found the problem. It turns out I was doing something unanticipated but valid in one method, although they did not tell me what, and this was causing the JIT compiler to generate some bad code. Instead of telling me to change my code, they fixed the compiler and gave me a patch.
The second problem I ran into was another crash bug. This one would manifest itself in between 3 and 5 page loads regardless of whether it was a same page reload or navigating through the site. Knowing a bit more this time, I was able to figure out that it was happening in my session handler. PHP allows you to override the default session handler to add in your own, which I do.
This bug had been reported. It's been a little while ago now, but if I remember correctly it was a repeat of my previous experience. Run some tests and report GDB dumps back. Try a patch. Again, everyone was so nice and helpful. After a few days, this bug was also fixed.
HipHopVM currently does not implement this function correctly. It returns garbage results. For this one, since it was clearly a bug (where as the two others before could easily have been something I was doing wrong), I filed a bug report. https://github.com/facebook/hhvm/issues/1100
One of the guys I had been chatting with in the IRC channel replied saying that this was a low priority for them but they would be grateful if I took a crack at it.
It's been 10 years since I've done any C++ and I haven't worked on a body of code this involved in a long time, but I dove in to see if I could get a handle on the codebase. While it's extremely lacking in comments or documentation, it is pretty well organized and the source files are clean. GDB, of course, doesn't know about the data structures in the code and can only give you basic types back. I asked if there was a way to inspect some of the larger data types and they quickly sent me some python scripts to source into GDB that allows you to view some of them. GDB has come a long long way. I eventually found the code that's misbehaving but not being that familiar with UNICODE implementations I figured I'd best leave it alone and just work around it in my code. I may go back to it and see if I can fix it in hhvm.
Note: HipHopVM defines HPHP_VERSION which you can test for in your PHP code for those rare instances where you need to work around some hhvm limitation or difference.
Interestingly, hhvm defines this function so that if you check for it's existence it will say it's there. i.e. function_exists("imap_rfc822_parse_adrlist") returns true. However, if you then call the function is raises a nasty exception because the function is just a stub that has not been filled in yet. So in an email class, I had to check for this and use a PHP implementation of the function instead.
This one took by far the longest to track down.
The Miles By Motorcycle codebase relies heavily on URL rewriting and this was working perfectly. However, I want to keep the option of running the same site on the old Apache and Zend PHP stack in case I run into problems with HHVM. This means I want to leave all my .htaccess files lying around but don't want hhvm to server them.
So I thought I could simply write a rewrite rule to present a 404 page when a .htaccess file request is made.
This did not work. No matter what I tried I could not get this to work and then, after some investigation, I realized rewriting only works if the source URL does not represent a file in the filesystem. For example, you can write a rewrite rule from image1.gif to image2.gif but only if image1.gif does not exist. If image1.gif exists in the filesystem, the rewriting rules get bypassed and image1.gif is served immediately.
This seemed very strange so I once again joined the developers in IRC and posted the question. They agreed it was likely a bug so I filed an issue. https://github.com/facebook/hhvm/issues/1283
Unfortunately for me, the team is currently focusing on getting FastCGI support ready so they won't have time to focus on this issue but mentioned that if I wanted to submit a patch they'd be happy to review it.
I didn't want to wait several months for FastCGI to work well and I thought if I could just get this issue resolved I'd be much closer to having a fully functional HHVM install. The code is so well organized it took me probably less than 15 minutes to find the offending function and to verify, thanks to a helpful comment, that in fact there was a "fast path" optimization that just automatically serves the file if it's present bypassing rewriting rules. The developers had asked that any changes I make should be governed by an additional configuration option and that the current behavior should remain the default. The config file handling is not as straight forward as I would like it to be but I found that I could easily add an option at the VirtualHost level. I submitted my first pull request. I had called the option StaticFastPath but they pointed out the same problem existed with PHP files, which I missed. So the option was renamed CheckExistenceBeforeRewrite and it is true by default to maintain the original behavior. The pull request was accepted so a few lines of code I wrote are in HHVM and I'm on the list of contributors.
Sometimes it's the little things in life, right? If this makes it onto Facebook's servers, the one if statement I wrote, which is invoked in every request, will be the most widely used code fragment I have ever written. Weird and somewhat depressing.
Setting CheckExitenceBeforeRewrite to false gets you the behavior you would expect coming from Apache mod_rewrite.
Once CheckExistenceBeforeRewrite was added to the codebase I could get my rewriting rules working. HHVM's configuration file is in an odd format called hdf which looks like some badly form JSON. The documentation is horrible and searching online quite a few people are running into problems working with it. I suspect it was, for many people, the odd rewrite bypass that was tripping them up. I could be mistaken.
Some things about the configuration file that confused me were:
Here's a santized version of the config that I have working on my development machine with some comments:
It is in fact as fast as they say it is. In my testing, pages which load in 5 seconds (Yea, I know, I said it was slow), now load in 0.6 seconds without any modifications to my codebase except for the couple where I had to work around issues that I've listed above. All told I only modified two functions in my entire codebase. Out of over 500 files filled with classes processing 1200 FVML files, I'd say that's not bad at all.
hhvm also uses significantly less memory than php. Based on what memory_get_usage() reports I'd say it's about half as much, but under hhvm memory_get_usage() seems a bit unreliable.
I am not yet running it on this site, so my users will suffer with this slow performance for a bit longer. I'm putting together a new dedicated server to transition the site to. Once I take it live I'll report back but so far absolutely everything on the site works perfectly under hhvm on my development box and it's so much faster.
For more information on HipHopVM check the following links:
If you have any questions about what I've done, would like to point out mistakes or inaccuracies or other issues I ran into, you can use the Contact link above to get in touch with me or follow me on twitter at: https://twitter.com/yermolamers
You must be a member of this group to post comments.
Please see the top of the page to join.