Objects as keys

I’m going to put to vote soon another of my RFCs, namely one about “objects as keys“. So, I want to outline the case for it here and address some criticisms and questions raised while discussing it.

Why we may want it it?

Traditionally, in PHP array keys could be strings or numbers, and this is deeply linked to how PHP hash tables (which store mostly everything in PHP that is a collection) work. And this was mostly enough until recently, when special type objects started popping up – such as GMP numbers. These work very closely to native numbers – i.e. you can add them, multiply or subtract them, etc. However, though they look like numbers, there are things you can do with numbers that you can not do with them. One of these things is using it as an array index.

There’s more – we have proposals to make UString class to represent Unicode strings. There may be more to represent special types of quantities and strings. It would be only natural for these to be able to do something that numbers and strings can do – namely, be array keys.

How to use it right

This idea is not without dangers, and certainly can be abused (channeling my internal Yogi Berra – especially if you’re using it wrong).

For starters, not every object is good for using as an array key. For example, mutable objects almost never are, since if you put something under key X and then it mutates to represent something different from X – where are you going to find it? PHP also doesn’t have very good means to control mutability, at least not just yet. Also, it may not be good to use complex objects as keys, e.g. a tree having 1000 elements rarely makes a good key since it’s not clear what exactly it even means it being a key, and how to really make it so same trees would produce same key but different ones would produce different ones – scanning every element would be quite expensive.

So this feature is mostly good for value objects – i.e. objects that express some simple and usually immutable value. Of course, there may be scenarios where other objects find it useful, but those would be exceptions from a general rule.

Why not just use __toString()?

Of course, it would be very easy to say “if the object has __toString, just convert it to string and we know how to deal with those”. However, there are two problems with this:

  1. Human-readable representation of an object and value used for keys may not be the same one. As I mention in the RFC, most languages that allow object keys, have separate functions, some for technical reasons (i.e. needing number instead of string), some for semantical. I think both of these reasons are valid – some values that objects represent may be more efficiently represented as numbers, and some the developer may want to be different when expecting the human and the engine to look at it.
  2. Using __toString has the reverse side – if we use __toString, every object that has __toString becomes hash-able. But, as I noted above, we may not want every object to work this way – for some, it just makes no sense and may be even dangerous, but __toString may still make sense. It would be better if whoever designs the class explicitly allows to do this. Of course, they can choose to go back to __toString – this door is always open – but they have an option not to.

Why not just use spl_object_hash()?

This function provides the identity of the specific object – but for value objects, its identity and what it represents may not be the same. I.e. do we want two distinct objects both representing number 42 considered different? Sometimes we may want it, but if those are true value objects, we may not. It may be even more complicated if the objects have some data that changes but still represent same values. The best solution here would be to give control to the developer, since the engine can not know what which part of the object means.

And, of course, the point of not being able to control which object can be used this way is the same as above.

Inherent implementation problems and disadvantages

One point that was raised when discussing it is that the objects do not really become the array keys – instead, we “using” them as the keys, but the value derived from them in developer-specified manner is used. So, if you do a foreach loop over such array later, you do not get the original object back. You could maybe reconstitute it from the derived value, if you chose the key representation cleverly, but it won’t be the original one. It is true, and the reason for that is what I said at the beginning – the implementation of PHP hashes. So if we really want that, we’d need different – potentially slower and more complex, though maybe not – hash tables. For myself, this is a much bigger task that I want to take now, and I am not sure if it will ever happen in PHP at all, as the percentage of cases where we need object support is not big, and messing with the mechanism that is used in literally everything in the engine for that may be not smart. I don’t say it can not be done, I just don’t believe it will be actually done in PHP within any reasonable time.

This is why, while I recognize my solution is not the ideal “objects as keys” solution, and the criticism pointing in this direction is valid, I think it still would be a useful feature and better than not providing any support for it while waiting for something that may never happen.

The name

The proposal names the proposed new magic function as __hash. In this name, __ is a given, but the rest is not. __toKey was proposed, and various others. I personally do not have a strong preference, and would be fine with any logical name. As both __hash and __toKey have its merits, I think the best may be to just have it as a voting option and see which one is the most appealing to the majority (provided, of course, the majority would support the proposal as a whole).

unserialize() and being practical

I have recently revived my “filtered unserialize()” RFC and I plan to put it to vote today. Before I do that, I’d like to outline the arguments on why I think it is a good thing and put it in a somewhat larger context.

It is known that using unserialize() on outside data can lead to trouble unless you are very careful. Which in projects large enough usually means “always”, since practically you rarely can predict all interactions amongst a million lines of code. So, what can we do?

Of course, the first thing would be to never use unserialize() in this context, and this means no problem, right? However, this approach has the following issues:

  1. It goes against what is natural for people (using PHP native serialization mechanisms) to do and what is widely done in the field. Usually when you try to work against what is natural for people to do, it is an uphill battle where losses are much more frequent than wins. Doing the right thing should be easy, and if it is not so, then the chance that right thing is not done raises accordingly. From that perspective, anything that makes doing the right thing easier is a benefit.
  2. There is no other mechanism which matches serialize() by capability but does not have its issues. Yes, I know in many cases data being serialized is simple enough so JSON or something akin to it would suffice. But sometimes it may not, and in that case we need some solution too. Let’s say we said using JSON is a best practice. However, let’s say one finds a rare corner case where it is not enough. What would we offer in that case? If we do not provide any solution, people would do homebrew solutions, and many of these will be done wrong.
  3. Contexts change, and what were internal context before may suddenly become exposed, and then may be in for an expensive refactoring effort if no other solution is available.

So that is why I think we should have a middle ground between “never use unserialize() on external data and if you do, you’re going to hell and we’re not going to talk to a sinner like you until you repent and rewrite all your code” and “let’s rewrite PHP library functions in PHP because that’s what it takes for our code to work”. I think it is a practical solution which allows your code to be more predictable (i.e, less prone to security issues) while allowing you to work with your code as it is and not requiring extensive rewrites.

Is this a security measure? I removed the reference “security” from the RFC title because I think it has lead the discussion in a wrong direction. Yes, it does not provide perfect security, and yes, you should not rely only on that for security. Security, much like ogres and onions, has layers. So this is trying to provide one more layer – in case that is what you need. I think it improves security but I’d much rather concentrate on the useful options that it adds to the programmer’s toolkit than on semantics of the term “security” and its implications.

Static typing

There is some renewed discussion about introducing static typing in PHP. I just read one very interesting post: The Safyness of Static Typing which I suggest everybody that is interested in this topic should read (and the links there). You may agree or disagree, but it is worth reading and even if you disagree it is worth ensuring you know the answers to the questions raised there, otherwise your disagreement lacks substance. I must admit I liked that post because it agreed with my feelings (not substantiated prior to that by any experimental data besides general experience I’ve acquired in the field) that type safety is not as close to silver bullet as some put it.

Within the context of PHP, I’m not sure if more strict typing (coercive typing is something in between and would require a bit different treatment) would be beneficial. I can see where it could be useful – i.e., for making JIT it probably would be very nice. On the other hand, Javascript has excellent JIT engines, as I have heard, without any additions of strict typing, so it’s not absolutely necessary. With PHP code living in runtime and static analysis tools not being routine part of mainstream development, at least as far as I have seen, I’m not sure addition of strict typing would help in any substantial way. Facebook guys, obviously, disagree – I wonder if they have some data to back it up, i.e. how that worked in practice and especially how “hybrid” model – i.e. having typed and untyped code coexist (that as I understand is what is happening, may be I am wrong here) works out and if it indeed provides better safety and reduced development time?

P.S. oh, and if you want a surefire way to annoy me, please call strict typing “type hinting”. I’m sure in the history of PHP there were examples of worse terminology (“safe mode” comes to mind as one) but that does not excuse this most unfortunate decision to name strictly typed arguments “hinting”.

PHP Spec – a dream come true

Almost 8 years ago, I wrote “What is PHP anyway?“. This blog is supposed to be about some long-term dreams, and in this case it was the dream come true – Sara Golemon and the excellent Facebook team made a draft PHP spec and with some paint and polish it can become a real spec pretty soon. Not sure if it can be ready by the 8th anniversary of that post, but it probably will be out by the 9th 🙂

Talking to people, I recently discovered not everybody knows this thing exists. So here it goes – it exists right here. It is still a draft. If you see something wrong, submit a pull request. If you feel you can contribute more by working on it or refining some points, “standards” mailing list was re-purposed to be the working group list.

PHP 5.4 – looking back

With 5.6.0 having been released and 5.4 branch nearing its well-earned retirement in security-fixes-only status I decided to try and revive this blog. As the last post before the long hiatus was about the release of the 5.4, I think it makes sense to look back and see how 5.4 has been doing so far.

Great

Release process. Combined with RFCs and git. It’s hard to believe we used not to have it. RFC process is working great, git makes all the processes tick and we have scheduled releases, working CI setup and much better predictability and management of releases overall. It’s no big deal unless you remember how it was before.

Built-in webserver. It really helps when you can just set up something browseable (is this a word? now it is) with PHP alone, without bothering with Apache setup and other moving parts. This is again a case of something that you don’t realize how much you missed it until you start using it.

$this support in closures. Having to write 5.3-compatible code for the last couple of years, I can’t emphasize enough how sorely it was missing in 5.3. I really regret the fact we could not get it into 5.3.

Syntax sugar like [] and <?= working everywhere. It’s a small thing but it adds up. I usually do not give much weight to saving couple of keystrokes and so on, but these to me really improve coder’s quality of life.

Removal of old “features” (since they ended up in the dump, is it right to call them features anymore?). Nobody is missing the safe mode or magic quotes or register_globals. Good riddance. Wish we parted ways sooner.

Meh

Traits. I must say I haven’t seen big adoption of the traits feature. Yes, of course people use it, there are tutorials, there are articles, etc. But at the same time compared to how much namespaces were needed or how much closures proved to be a great help, traits adoption, IMHO, remains lukewarm at best. To me, it has not lived yet up to its promise. Maybe I’m missing something, tell me if you have great examples there.

Adoption. This is a problem for new PHP versions and for developers of distributable PHP software – PHP versions are becoming “good enough” and people are reluctant to move forward, which also delays adoption of new features by library & packaged software writers. Look at the numbers: almost 3/4 of the PHP developers are using EOLed versions! 5.4 adoption is low at 22% and 5.5 adoption is abysmal. I hope that more streamlined release cycles and heightened attention to BC matters would bend this tendency. But so far it is not encouraging. WordPress numbers look even worse.

Don’t know

callable type. How wide is the usage? How useful it is in practice? Is it being used in major projects? I really don’t know.

Performance. It feels weird to put an obviously great improvement in this category. I would expect performance be a major driver for people to move forward, but the numbers suggest otherwise. As much as I love the performance improvements (a lot!), I really have no idea on how much it influenced the community and made them go to 5.4 (or beyond). Are there any surveys, links, studies, etc. in this regard? I see a lot of talk about performance but how many people also walk the walk? Are the numbers quoted above misleading?

mysqlnd. 5.4 is the first version where mysqlnd is the default mode of doing mysql. It has better performance, great features and I’ve been using it for years without a problem. But how widely it is adopted – do people still prefer the old way or love the new one? Do they use the plugin API widely?

Anything else?

Did I forget to mention something that really made your life better in 5.4? Did I conceal some flop that you wish we didn’t do? Please tell.

Namespaces FAQ

We now have an implementation of namespaces in PHP 6 HEAD, so here’s a short FAQ about how they work for those that are too laz^H^H^Hbusy to read the whole README.namespaces.

Q. Why PHP needs namespaces?
A. Because long names like PEAR_Form_Loader_Validate_Table_Element_Validator_Exception are really tiresome.

Q. What is the main goal of the namespace implementation?
A. To solve the problem above.

Q. What “namespace X::Y::Z” means?
A: 1. All class/function/method names are prefixed with X::Y::Z.
2. All class/function/method names are resolved first against X::Y::Z.

Q. What “import X::Y::Z as Foo” means?
A. Every time there’s Foo as a class/function name or prefix to the name, it really means X::Y::Z

Q. What “import X::Y::Z” means?
A. “import X::Y::Z as Z”, then see above.

Q. What “import Foo” means?
A. Nothing.

Q. What is the scope of namespace and import?
A. Current file.

Q. Can same namespace be used in multiple files?
A. Yes.

Q. Is there any relation between namespaces X::Y::Z and X::Y?
A. Only in programmer’s mind.

Q. How do I import all classes from namespace X::Y::Z into global space?
A. You don’t, since it brings back the global space pollution problem.
Instead, you import X::Y::Z and then prefix your classes with Z::.

Q. But doesn’t it mean I will still have long names?
A. Not longer then three elements: Namespace::Class::Element.

Q. Why it is not implemented like in <insert your favorite language here>?
A. Because PHP is not <insert your favorite language here> 😉

Also we are considering to add one more feature to namespaces – ability to declare a namespaced constant – i.e. constant named Name::Space::NAME – with same resolution rules like classes – with const operator. Consequently it may be also possible to have const NAME = ‘value’ in global context, meaning the same as define(‘NAME’, ‘value’).

Also note namespaces are still work in progress, so it may happen it would be changed a lot when it’s released.