Objects as keys

I’m going to put to vote soon another of my RFCs, namely one about “objects as keys“. So, I want to outline the case for it here and address some criticisms and questions raised while discussing it.

Why we may want it it?

Traditionally, in PHP array keys could be strings or numbers, and this is deeply linked to how PHP hash tables (which store mostly everything in PHP that is a collection) work. And this was mostly enough until recently, when special type objects started popping up – such as GMP numbers. These work very closely to native numbers – i.e. you can add them, multiply or subtract them, etc. However, though they look like numbers, there are things you can do with numbers that you can not do with them. One of these things is using it as an array index.

There’s more – we have proposals to make UString class to represent Unicode strings. There may be more to represent special types of quantities and strings. It would be only natural for these to be able to do something that numbers and strings can do – namely, be array keys.

How to use it right

This idea is not without dangers, and certainly can be abused (channeling my internal Yogi Berra – especially if you’re using it wrong).

For starters, not every object is good for using as an array key. For example, mutable objects almost never are, since if you put something under key X and then it mutates to represent something different from X – where are you going to find it? PHP also doesn’t have very good means to control mutability, at least not just yet. Also, it may not be good to use complex objects as keys, e.g. a tree having 1000 elements rarely makes a good key since it’s not clear what exactly it even means it being a key, and how to really make it so same trees would produce same key but different ones would produce different ones – scanning every element would be quite expensive.

So this feature is mostly good for value objects – i.e. objects that express some simple and usually immutable value. Of course, there may be scenarios where other objects find it useful, but those would be exceptions from a general rule.

Why not just use __toString()?

Of course, it would be very easy to say “if the object has __toString, just convert it to string and we know how to deal with those”. However, there are two problems with this:

  1. Human-readable representation of an object and value used for keys may not be the same one. As I mention in the RFC, most languages that allow object keys, have separate functions, some for technical reasons (i.e. needing number instead of string), some for semantical. I think both of these reasons are valid – some values that objects represent may be more efficiently represented as numbers, and some the developer may want to be different when expecting the human and the engine to look at it.
  2. Using __toString has the reverse side – if we use __toString, every object that has __toString becomes hash-able. But, as I noted above, we may not want every object to work this way – for some, it just makes no sense and may be even dangerous, but __toString may still make sense. It would be better if whoever designs the class explicitly allows to do this. Of course, they can choose to go back to __toString – this door is always open – but they have an option not to.

Why not just use spl_object_hash()?

This function provides the identity of the specific object – but for value objects, its identity and what it represents may not be the same. I.e. do we want two distinct objects both representing number 42 considered different? Sometimes we may want it, but if those are true value objects, we may not. It may be even more complicated if the objects have some data that changes but still represent same values. The best solution here would be to give control to the developer, since the engine can not know what which part of the object means.

And, of course, the point of not being able to control which object can be used this way is the same as above.

Inherent implementation problems and disadvantages

One point that was raised when discussing it is that the objects do not really become the array keys – instead, we “using” them as the keys, but the value derived from them in developer-specified manner is used. So, if you do a foreach loop over such array later, you do not get the original object back. You could maybe reconstitute it from the derived value, if you chose the key representation cleverly, but it won’t be the original one. It is true, and the reason for that is what I said at the beginning – the implementation of PHP hashes. So if we really want that, we’d need different – potentially slower and more complex, though maybe not – hash tables. For myself, this is a much bigger task that I want to take now, and I am not sure if it will ever happen in PHP at all, as the percentage of cases where we need object support is not big, and messing with the mechanism that is used in literally everything in the engine for that may be not smart. I don’t say it can not be done, I just don’t believe it will be actually done in PHP within any reasonable time.

This is why, while I recognize my solution is not the ideal “objects as keys” solution, and the criticism pointing in this direction is valid, I think it still would be a useful feature and better than not providing any support for it while waiting for something that may never happen.

The name

The proposal names the proposed new magic function as __hash. In this name, __ is a given, but the rest is not. __toKey was proposed, and various others. I personally do not have a strong preference, and would be fine with any logical name. As both __hash and __toKey have its merits, I think the best may be to just have it as a voting option and see which one is the most appealing to the majority (provided, of course, the majority would support the proposal as a whole).

Default constructors

Consider the following code:

class Animal {
    protected $what = "nothing";
    function sound() {
        echo get_class($this)." says {$this->what}"; 
    }
}

class Cow extends Animal {
    protected $what = "moo";
    protected $owner;
    public function __construct($owner) {
        $this->owner = $owner;
        // parent::__construct(); (?)
    }
}

$a = new Cow("Old McDonald");
$a->sound();

This code represents a simple class hierarchy. Now let us consider the line marked by (?). Of course we can not call the parent ctor there since we do not have one. But let’s say we refactored the base class and added the parent ctor which does some stuff:

class Animal {
   protected $born;
   public function __construct() {
      $this->born = time();
   }
}

Seemingly, we didn’t do anything wrong here, right? But now our code is broken, since Cow::__construct does not call Animal::__construct. So we should go to every class extending Animal and fix them. The problem here we could not avoid this problem – unless we stick empty ctor into Animal when it doesn’t need it, we can not call it from Animal’s child classes. Sticking empty ctor into every class in case we’d ever want to extend it does not sound like a nice idea. Not being able to add a default ctor (i.e. one not needing any parameters) to a base class is also not good.

So what if we make default ctor always exist? If it’s not defined, calling parent::__construct() would just do exactly nothing. But if we ever implement it, all the child classes will be ready.

In fact, in Java for example it is mandatory to call the parent ctor, and if the class has none the default one is supplied by the language.
PHP does not enforce it, but it is very rarely a good idea not to. Right now, PHP does not allow to do the right thing here, but it should.

unserialize() and being practical

I have recently revived my “filtered unserialize()” RFC and I plan to put it to vote today. Before I do that, I’d like to outline the arguments on why I think it is a good thing and put it in a somewhat larger context.

It is known that using unserialize() on outside data can lead to trouble unless you are very careful. Which in projects large enough usually means “always”, since practically you rarely can predict all interactions amongst a million lines of code. So, what can we do?

Of course, the first thing would be to never use unserialize() in this context, and this means no problem, right? However, this approach has the following issues:

  1. It goes against what is natural for people (using PHP native serialization mechanisms) to do and what is widely done in the field. Usually when you try to work against what is natural for people to do, it is an uphill battle where losses are much more frequent than wins. Doing the right thing should be easy, and if it is not so, then the chance that right thing is not done raises accordingly. From that perspective, anything that makes doing the right thing easier is a benefit.
  2. There is no other mechanism which matches serialize() by capability but does not have its issues. Yes, I know in many cases data being serialized is simple enough so JSON or something akin to it would suffice. But sometimes it may not, and in that case we need some solution too. Let’s say we said using JSON is a best practice. However, let’s say one finds a rare corner case where it is not enough. What would we offer in that case? If we do not provide any solution, people would do homebrew solutions, and many of these will be done wrong.
  3. Contexts change, and what were internal context before may suddenly become exposed, and then may be in for an expensive refactoring effort if no other solution is available.

So that is why I think we should have a middle ground between “never use unserialize() on external data and if you do, you’re going to hell and we’re not going to talk to a sinner like you until you repent and rewrite all your code” and “let’s rewrite PHP library functions in PHP because that’s what it takes for our code to work”. I think it is a practical solution which allows your code to be more predictable (i.e, less prone to security issues) while allowing you to work with your code as it is and not requiring extensive rewrites.

Is this a security measure? I removed the reference “security” from the RFC title because I think it has lead the discussion in a wrong direction. Yes, it does not provide perfect security, and yes, you should not rely only on that for security. Security, much like ogres and onions, has layers. So this is trying to provide one more layer – in case that is what you need. I think it improves security but I’d much rather concentrate on the useful options that it adds to the programmer’s toolkit than on semantics of the term “security” and its implications.

Static typing

There is some renewed discussion about introducing static typing in PHP. I just read one very interesting post: The Safyness of Static Typing which I suggest everybody that is interested in this topic should read (and the links there). You may agree or disagree, but it is worth reading and even if you disagree it is worth ensuring you know the answers to the questions raised there, otherwise your disagreement lacks substance. I must admit I liked that post because it agreed with my feelings (not substantiated prior to that by any experimental data besides general experience I’ve acquired in the field) that type safety is not as close to silver bullet as some put it.

Within the context of PHP, I’m not sure if more strict typing (coercive typing is something in between and would require a bit different treatment) would be beneficial. I can see where it could be useful – i.e., for making JIT it probably would be very nice. On the other hand, Javascript has excellent JIT engines, as I have heard, without any additions of strict typing, so it’s not absolutely necessary. With PHP code living in runtime and static analysis tools not being routine part of mainstream development, at least as far as I have seen, I’m not sure addition of strict typing would help in any substantial way. Facebook guys, obviously, disagree – I wonder if they have some data to back it up, i.e. how that worked in practice and especially how “hybrid” model – i.e. having typed and untyped code coexist (that as I understand is what is happening, may be I am wrong here) works out and if it indeed provides better safety and reduced development time?

P.S. oh, and if you want a surefire way to annoy me, please call strict typing “type hinting”. I’m sure in the history of PHP there were examples of worse terminology (“safe mode” comes to mind as one) but that does not excuse this most unfortunate decision to name strictly typed arguments “hinting”.

LinkedInSecurity

This is an uncharacteristically non-PHP post, but I thought it may interest the audience anyway, and this is as good place as any to have it. So the TLDR of this post is that I’ve recently had an interaction with certain security issue in LinkedIn, this issue is still there, LinkedIn is not inclined to fix it and you may be affected.

The Story

All names (except, obviously, LinkedIn) in the story has been changed to protect the privacy, but refer to real people, entities and events.
The story of discovering this issue begun when one morning I have woken up and found in my mailbox a message saying “here’s the link to reset your password” from LinkedIn. As I have not reset my password on LinkedIn, I was somewhat surprised, but thought – OK, maybe somebody is trying to play tricks with my account, I’m pretty sure this would go nowhere. Then, as my brain was waking up, I looked at the email closely and discovered two things:

  1. This email has not my name, by the name of my colleague, let’s call him B., at the company, let’s call it Westeros Inc.
  2. The email was not sent to me directly, neither it was sent to B. directly, instead it was sent to an internal company mailing list goldcloaks@westerosinc.com.

I didn’t know what to make out of it but decided maybe B. copy-pasted wrong address to some field in LinkedIn.
Later the same day, talking to B. and other coworkers, I have mentioned this event. B. said that he indeed reset his LI password recently, but he never added the goldcloaks list to LinkedIn. I’ve started to get suspicious and asked how then I’ve got his password reset email? He didn’t know. So we (myself and B.) did an experiment:

We went to LinkedIn, logged out and clicked “forgot password” on the login form. Then we entered the address of the goldcloaks@westerosinc.com and in a couple of seconds, I’ve got the password reset link, with B.’s name on it. Clicking on that link, I’ve got a form to reset the password (no additional questions like what’s my favorite pokemon) and after another click I’ve got the email saying “B., your password was successfully reset“. I used the new password to log in, but then I was stopped by the two-factor verification. Which means two things: 1) password change worked, since 2-factor kicks in only when password is right and 2) B. is a smart man and has protected his account against password thieves. I had to ask him for the code – now that I have his account’s password, this was the only way to give him the control back. After getting the code, I could successfully log into his account and could see all his deepest secrets (which I didn’t) or return the control back to him (which I did). Before that, we verified that goldcloaks@westerosinc.com is indeed in B.’s list of account’s emails.

Then I decided to see how comes the goldcloaks list ended up in B.’s email list. I went to my own email list, and, surprisingly, discovered that in my own list, among my regular emails, there is another mailing list, maesters@westerosinc.com, which I definitely did not ever add there and had zero reason to. I asked other people sitting around in the office to check their lists and they too have discovered a couple of extra emails, added by some mysterious way, in their profiles.

The Analysis

Basing on these discoveries, I have arrived at the following conclusions:

1. There is a way, currently unknown to me, to add a group mailing list to one’s profile on LinkedIn, without their explicit consent (at least without them knowing that this is what they consented to).
2. LinkedIn accepts this group list email and any non-primary email as an email to send password reset requests too.
3. Reading emails from this address is the only thing needed to reset the password – even if 2-factor auth is enabled. With 2-factor auth, you will not be able to access the account after the password has been reset (unless you find a way to cheat there, I did not try) but you will be able to reset the password.
4. For the majority of people asked, LinkedIn password emails to goldcloaks@westerosinc.com ended up in a spam folder, which means the victim of the shenanigans may not even notice what happened.

This looked like a security issue, so I have written up the whole story (in a bit less colorful words than here) to security@linkedin.com and went back to work, expecting the email from LinkedIn with heartfelt thanks and promises of speedy fix implementation.

The Security Response

Of course, that is not what happened. Instead, what happened that I have got an answer from some very helpful individual from frontline support, asking me for “detailed information about your problem and if you think it might help, attach a screenshot, too“. As I have just spent significant time on composing big encrypted email full of details, I was a bit confused as to which details I was missing and where screenshots may be useful there, but I have not relented at first and wrote second explanation of the issue. The response was:

1. LinkedIn support took the extraordinary security measure of logging me out of all my current sessions with LinkedIn.
2. They advised me not to write down my password in publicly accessible places and suggested that if I continue to leave my computer sitting around in public places without logging out, bad things may happen to my account. My sincerest pleas that such thing never happened and the problem I am talking about is not because I forgot my laptop in a pub while being drunk (and so, apparently, did my coworkers) were met with utter disbelief. They also instructed me to not use my LinkedIn password on other sites and gave me a full page of very useful boilerplate password security advise, as prudent as having no relation to the case being discussed.
3. They assured me that my account was not compromised (which I never implied) and my password is safe.
4. They assured me that “The only way to add an email into an account is via the settings after logging in.”

By that time I was sure nobody at LinkedIn is going to believe me there’s a problem (beyond my implied propensity to leaving my laptop around and thus letting strangers add emails to my LinkedIn account) so I decided I’ve done my responsible disclosure part and should not spend more time on it. However, then I’ve got another email from LinkedIn stating this:

Sometimes, when a member accepts an invitation to connect that was sent to an email distribution list, that list becomes associated with the member’s account.

Please be assured that no one on the distribution list would be able to use the password reset link to access your account unless they knew both your email address and your password.

The first part, of course, completely belies the claim that “The only way to add an email into an account is via the settings after logging in.“, as apparently the other way is to send an invitation via the email list and have it accepted. The second part, however, can not be true, as password reset link can not require anybody to know the password – such link would be completely useless, and they do not even need to know my email – only the list email. But this provided the confirmation and brings us to the conclusion.

The Conclusion

  1. There is, indeed, a way to inject group email address into your LinkedIn account, LinkedIn knows about it and they don’t see any problem with it. Most probably, this can be done by sending an invitation for a person to connect to a mailing list. You can imagine the social engineering possibilities.
  2. While you can see the target email in the email connect invite from LinkedIn, you can not see it, AFAIK, in the LinkedIn web interface, which makes “group” invite indistinguishable from a regular one.
  3. There is, and probably will be for a foreseeable time, a way to use that group email address to reset your password using that group, by anybody who has access to group emails.
  4. LinkedIn knows about the issues outlined above but they do not perceive it as a security issue.

The Advice

So here’s some advice if you have a LinkedIn account:

  1. Enable two-factor on your LinkedIn account NOW.
  2. Check your email list (go to Settings, click on “Account” and then “Add & change email addresses”) and see if you don’t have any unknown emails there. Do that at regular intervals, especially after accepting connections.
  3. Do not accept connections from strangers that you do not recognize. 
  4. Do not expect big companies to have a meaningful way to report a security problem.

And a wishlist for LinkedIn:

  1. Make password request only work with primary email.
  2. Make associating an email with the account always an explicit action.
  3. Have some way to escalate security issues. 

If you have any additional info or ideas on this topic, please feel free to comment.

PHP Spec – a dream come true

Almost 8 years ago, I wrote “What is PHP anyway?“. This blog is supposed to be about some long-term dreams, and in this case it was the dream come true – Sara Golemon and the excellent Facebook team made a draft PHP spec and with some paint and polish it can become a real spec pretty soon. Not sure if it can be ready by the 8th anniversary of that post, but it probably will be out by the 9th :)

Talking to people, I recently discovered not everybody knows this thing exists. So here it goes – it exists right here. It is still a draft. If you see something wrong, submit a pull request. If you feel you can contribute more by working on it or refining some points, “standards” mailing list was re-purposed to be the working group list.

PHP 5.6 – looking forward

Having taken a look in the past, now it’s time to look into the future, namely 5.6 (PHP 7 is the future future, we’ll get there eventually). So I’d like to make some predictions of what would work well and not so well and then see if it would make sense in two years or turn out completely wrong.

High impact

I expect those things to be really helpful for people going to PHP 5.6:

Constant expressions – the fact that you could not define const FOO = BAR + 1; was annoying for some for a long time. Now that this is allowed I expect people to start using it with gusto.

Variadics – while one can argue variadics are not strictly necessary, as PHP can already accept variable number of args for every function, if you’re going to 5.6 the added value would be enough so you’d probably end up using them instead of func_get_args and friends.

Operator overloading for extensions – the fact that you can sum GMP numbers with + is great, and I think more extensions like this would show up. E.g., for business apps dealing with money ability to work with fractions without precision loss is a must, and right now one has to invent elaborate wrappers to handle it. Having an extension for this would be very nice. Finding a way to transition from integer to GMP when number becomes too big would be a great thing too.
Still not convinced having it in userspace is a great idea, what C++ did to it is kind of scary.

phpdbg – not having gdb for PHP was for a long time one of the major annoyances. I expect to use it a lot.

Low impact

Function and constant importing – this was asked for a long time, but I still have hard time believing a lot of people would do it, since people who need imports usually are doing it in OO way anyway.

Hurdles

OpenSSL becoming strict with regard to peer verification by default may be a problem, especially for intranet apps running on self-signed certs. While this problem is easily fixable and the argument can be made that it should have been like this from the start – too many migrations go on very different paths depending on if it requires changing code/configs or not.

Adoption – again, with 5.5 adoption being still in single digits, I foresee a very slow adoption for 5.6. I don’t know a cure for “good enough” problem and I can understand people that do not want to move from something that already works, but look at the features! Look at the performance! I really hope people would move forward on this quicker.

While 5.4 will always have a special place in my heart, I hope people now staying on 5.2 and 5.3 would jump directly to 5.6 or at least 5.5. The BC delta in 5.5 and 5.6 is much smaller – I think 5.3->5.4 was the highest hurdle recently, and 5.4 to 5.5 or 5.6 should go much smoother.

Anything you like in PHP 5.6 and I forgot to mention? Anything that you foresee may be a problem for migration? Please add in comments.