2017-10-02

On base knowledge surveys

When I was in college, one of my Research Assistant jobs was to do clerical and basic number crunching work for base knowledge surveys. It takes scrupulous and expensive controls to prevent roughly a third of the surveys from being randomly answered for the lulz. Even with the most scrupulous controls and careful interview technique, there is still roughly 5% noise.

In other words, any newspaper headline of a newspaper article that is a restatement of the abstract of some random paper from some random academic journal based on some survey, especially if its a prepub paper or open access journal, that is of the form of "ONE THIRD OF LILPUTIANS ARE STUPID, ACCORDING TO SURVEY", is rank bullshit, and is anti-knowledge, as in anyone who reads it is less informed afterwards than before.

Show me the survey questions, the interview technique, the responder selection process, the population size, the population demographics, the pre-survey stats oversight board approval, the post-survey stats oversight board signoff, and the raw data, and THEN we will talk.

(My lead researcher when I was a RA sat on several of those Stats Oversight Approval Boards. I got well schooled in several of the ways that a researcher could lie to themselves, knowingly and unknowingly.)

2017-04-17

A theory: Bank of Apple

I have a theory, about Apple.

Apple has a quarter of a trillion dollars.  In cash.

That is a ludicrous amount of money.  That is so much money, that it is too much money.  It is too much to deposit as passive cash, because a bank can no longer be a neutral unbiased 3rd party when there is an account that big.   Not even a large nation's central bank.  When you have that much cash, risks like counterparty risk, fiat currency problems, and government confiscation start becoming a significant amount of the risk profile.

The way that most very large companies waste very large amounts of cash is to buy other large companies.  This almost invariably is a terrible idea.  The buying company almost always overpays, especially when they start bidding against someone else.  And companies on sale for a discount, are for sale for a discount for a reason.  So called claimed "synergies" almost never are realized.  Costs are always higher than expected.  Big mergers and aquisitions almost always are a mechanism where the senior executives burn the investors money in order to make said executives seem or feel more important.

Steve Jobs never felt the need to do M&A to seem or feel more important, so Apple only did aquistitions to obtain specific skilled teams or specific technologies.  And in his passing, Apple has generally continued this pattern.  (I think the aquisition of Beats was very non-Apple, and was a mistake, and probably is seen as an expensive mistake and expensive lesson by Apple's current leadership.)

But so and still, Apple has Too Much Cash.  What to do with it?

When you have a pile of cash that is so large that it in itself starts turning into a local economic distortion, there really is only one profitable thing to do with it:  Wrap a banking license around it, and open a bank.

Think about it.  Apple could run a bank, a very different kind of bank, with a much lower risks of fraud and loss.  They already have secure cryptoprocessors... everywhere! They can use iOS devices as the secure terminals, both for customers and for merchants.  They can use ApplePay for retail transactions.  They can use what they know about your from user's iOS devices and their AppleID for KYC.  They can push secure finantial messages around via the iMessage framework.  

With all this in place they could undercut all of the existing payment networks and still make, well, bank.

2016-05-24

notes from an opinionated talk about running IPv6 in production

A few years ago, I was at SCaLE, and attended an excellent talk by someone who operated several campus-wide internetworks, and their hard won experience with IPv6.  They were very opinionated.  I loved it.  Here are some of the notes from that talk:


QoS is a bad word.
Control freaks love QoS.
They can debug it themselves.
People who have are held to SLAs operating production networks have better things to waste their time on, and better ways to crash their switches.

"But I'm not running IPv6!"  That means you actually are, and are nor longer in control of your network.
"I will block IPv6!".  Say goodbye to all the grants that pay your salary.  And everyone's desktops and devices will just make tunnels anyway.

Say NAT one more time, I dare you.

If you think that NAT is protecting you, let me know who you are, so I can blackhole your address range and your IS.

Turning off v4 ICMP is just stupid.
There are lots of stupid people.

You cannot turn off icmp6.
There is no frag in v6.
Thus mtu detect must be on.
Thus icmp6 must be on.
Live with it.

dhcp6 is port 547 not 67

2016-02-13

Regarding that article about gender bias in GitHub Pull Requests

Regarding "Gender Bias In Open Source: Pull Request Acceptance Of Women Vs. Men", or even worse, regarding all the uncritical and breathless articles by the BBC, Vice, HuffPost, and so forth:

First of all, anyone who names their project "DeveloperLiberationFront" and uses an icon of a raised fist in woodcut style, has already predeclared their bias away from objective truth.

Second, the authors of the paper exhibit little knowledge of about the large differences in workflow between different projects, and no knowledge about all the different ways that PRs are used and all the different meanings of an "abandoned PR", and also their definition of "project insider" is broken, as for many projects, an "insider" has write access, and may never use PRs at all.

Third, despite GitHub's growing influence, just grabbing tens of thousands of GH PRs is not in the slightest bit representative.

Fourth, their process for computing the gender of PR authors is laughably bad, for reasons that went on for 3 paragraphs before I edited down this text.

Fifth, how many have heard of "p-hacking"? or even have ever actually computed a p value since you took that really annoying stats class in college?  Did you even notice that the this paper both obviously did p-hacking, and then didn't even report the p values?

Finally, allow me to present the following disruption to the breathless and self-reinforcing narrative:

"So, let’s review. A non-peer-reviewed paper shows that women get more requests accepted than men. In one subgroup, unblinding gender gives women a bigger advantage; in another subgroup, unblinding gender gives men a bigger advantage. When gender is unblinded, both men and women do worse; it’s unclear if there are statistically significant differences in this regard. Only one of the study’s subgroups showed lower acceptance for women than men, and the size of the difference was 63% vs. 64%, which may or may not be statistically significant. This may or may not be related to the fact, demonstrated in the study, that women propose bigger and less useful changes on average; no attempt was made to control for this. This tiny amount of discrimination against women seems to be mostly from other women, not from men."
// ScottAlexander

If this was a real paper, submitted for real peer review, a good peer review would be:

"1. Report gender-unblinding results for the entire population before you get into the insiders-vs.-outsiders dichotomy.
2. Give all numbers represented on graphs as actual numbers too.
3. Declare how many different subgroup groupings you tried, and do appropriate Bonferroni corrections.
4. Report the magnitude of the male drop vs. the female drop after gender-unblinding, test if they’re different, and report the test results.
5. Add the part about men being harder on men and vice versa, give numbers, and do significance tests.
6. Try to find an explanation for why both groups’ rates dropped with gender-unblinding. If you can’t, at least say so in the Discussion and propose some possibilities.
7. Fix the way you present “Women’s acceptance rates are 71.8% when they use gender neutral profiles, but drop to 62.5% when their gender is identifiable”, at the very least by adding the comparable numbers about the similar drop for men in the same sentence. Otherwise this will be the heading for every single news article about the study and nobody will acknowledge that the drop for men exists at all. This will happen anyway no matter what you do, but at least it won’t be your fault.
8. If possible, control for your finding that women’s changes are larger and less-needed and see how that affects results. If this sounds complicated, I bet you could find people here who are willing to help you.
9. Please release an anonymized version of the data."
// ScottAlexander

I am willing to bet money that doing real honest academic statistical analysis of their raw data will invalidate their implications and their claims.

2016-01-22

Why SSH keys dont have metadata

And other tech rant. It was recently asked, in a forum that I read, the following: "Why is it that SSH public keys don’t have an embedded expiration date, anyway? PKI certificates have them."

My response:

Because as soon as you start adding all sorts of metadata to a key, then everyone will start adding all sorts of metadata to keys, with all sorts of obscure rules about how metadata interact with the environment and various implementations whether a key works or not.

And then the lawyers will show up and insist that you imbed 30 page PDFs of Word docs of someone’s T&Cs and their contracts of adhesion and their “don't hold anyone with money responsible for anything” disclaimers into metadata (you think I joke, I do not at all, this literally regularly happens with “standards based” PKI certs).

And then your keys are going to be huge weirdly encoded binary blobs of shit that you don’t have good tools to manipulate. And you will need to keep special indexes of them, and “bundles” of them, in multiple conflicting filesystem paths and “key stores”.

Part of why SSH took off at all in the first is because it doesn’t have this complex garbage wankery . An SSH public key is a SINGLE LINE, of printable ASCII7. You can edit and clean up your ~/.ssh/authorized_keys file with a textmode text editor.

The lack of metadata in SSH is a feature, not a problem.

2016-01-20

This is how to do it, or waving my cane.

1. Design a data abstraction that solves a class of problems.

2. Design a good wire protocol for that abstraction.

3. Better yet, design 2 protocols: one server-to-server and one client-to-server. Federation is the only model that has ever scaled large enough.

4. Implement a simple as possible server. Do not try too hard to make it performant, just very easy to install and very easy to understand. This is the protocol reference implementation.

5. Implement an open source client library, that completely covers the entire data model and the entire wire protocol.

6. Implement another open source client library, in a very different programming language. If this is difficult, you let your knowledge of your favorite language overconstrain the wire protocol. Go back to step 2 and fix it.

7. Implement a command line client on one of those libraries. Again, it must completely cover the entire data model.

8. Implement an ok GUI app.

9. Implement a very high performance highly scalable server. If you are tempted to change the wire protocol to do this, you screwed up.

10. Now, and only now, you can implement a very nice easy to use GUI. At this point, and at this point only, do you bring in any "designers", "UX" people, or anyone who uses Photoshop as working tool.


Of course, for the past 15 years, everyone has been doing this backwards, with disastrous results. It takes huge amounts of wasted CPU and wasted money by the millions and billions to make all the resulting garbage work at all.

2015-12-11

Idea: RedFish aggregators, and running them on OpenSwitch

Once upon a time, when you needed to "do stuff" to take care of a computer, you had to go there in person.  By "do stuff", that means things like: turning it off and on, looking to see if the AC was working, were the tape or disk motors broken, were any of the red warning lights on, had the UPS tripped, and so forth.   But, for many and obvious reasons, it was useful to do all this kind of stuff from a distance.

This led to the creation of "IPMI", which was built into most computers that were designed to be used in racks and datacenters.  With IMPI, a team of sysadmins could remotely turn computers on and off, check temperature, fans, power, network carrier, installed cards and devices, and read off model numbers, part numbers, and serial numbers.

IPMI is currently being improved/replaced with a thing called "RedFish".  RedFish does all the same sort of things, but it is designed in a way that is called "RESTful", which means it works the same way that web applications work, which makes it a lot easier to write tools that speak it.  Another cool thing about RedFish is that it accidentally also looks like a complete database of a "computer like thing", and does it in a way that "things" can be inside "things" and connected to other "things" all within how the protocol works.

And then I had an idea...

Write a web application that scans the local network looking for RedFish servers, and then itself acts like a RedFish server that integrates all these other smaller RedFish servers.

You can even stack this, making it so at a higher level, one of these "RedFish aggregators" discovers and integrates the lower level ones, and so on up.   Eventually you would have a top level one that would give you all the data and all the control over an entire datacenter or even larger set of data centers.

It wouldn't even be that terribly hard to write a small demonstration implementation.  It would be a challenge to make it fast and efficient, and to properly handle caches and avoid accidental recursion loops, but it doesn't look like a really difficult one.


To use something like this for real, the logical place to put it would be in the network switches.   But that used to be difficult, because production level network switches have been very closed and proprietary.  However, that's changing.   There is a new open source project spinning up right now, called "OpenSwitch".  If I was to push this RedFish aggregator so that it would be real world useful, I would make it a be a module that runs in the reference OpenSwitch box.


How hard could it be?