2012-08-04

Mark's Stories: "The carpets are so clean, we don't need janitors!"

(edit: Hello to the folks who are coming here from YCombinator Hacker News.  Feel free to comment here as well as on YCHN.  I've posted a few more details about this story at YCHN.)

At one company I worked at, one of the problems it didn't have was IT.

When someone was hired, by the time they got to their new desk, there was a computer on it with the correct image on it, their desk phone worked, their email worked, the calendaring and scheduling worked, and all necessary passwords and ACLs were configured.  The internal ethernet networks all worked, were fast, and were properly isolated from each other.  The wall ports were all correctly labeled, and there where the right kinds of wall ports in each cubical and conference room. The presentation projectors and conference room speaker phones all worked. The printers all worked, printed cleanly, were kept stocked, and were consistently named. The internet connections were fast and well managed.  Internal and external security incidents were quickly recognized and dealt with. Broken machines were immediately replaced with working and newly imaged replacements. If someone accidentally deleted a file, getting it back from backup typically took less than an hour. Software updates were announced ahead of time, and usually happened without issue.

The IT staff did not seem noticeably bitter, angry, harried, or otherwise suffering from the emotional costs traditionally endemic to that job role.  In fact, they were almost invisible in their skill and competence.

So, of course, came the day when the senior executives said "the carpets are just naturally clean all the time, we don't need all these janitors!". IT was "reorganized" into a smaller staff of younger and much less experienced (and probably cheaper) people.

Of course, it all went to shit.  New employees would go a week before they had machines, phones, passwords, and ACLs.  Printers ran out of paper, projectors ran out of lightbulbs, servers ran out of storage, networks got misconfigured, and so forth.  The total time lost and wasted across the whole company was most certainly greater than the savings of laying off the expensive and skilled IT staff.

This is not to say that the reorganized IT staff were stupid or lazy.  They worked very hard and ran themselves ragged trying to keep up with the cycle of operations, while trying to skill themselves up in their "spare time" and with a slashed training budget.

The lessons I learned from this experience speak for themselves.

What lessons that may have been learned by any of the other people involved, especially the executives who made these decisions, I cannot say.

2012-08-03

Mark's Stories: The Closed Source Drivers


I was working on a highly constrained consumer electronics device, a little "satellite device" that spoke to the main device over a CATV RF coax cable and also received commands from an IR remote control.  My code was failing in bizarre ways.  I adopted an extremely paranoid defensive programming stance, filling my code with asserts and doing paranoid cross checking of all inputs.  This didn't make the device work.  Instead it now consistently did not work, instead of inconsistently, because the cross checks and asserts would usually (but not always) trip before it would crash. It also started to run out of memory because of the all the paranoia code I had added.

I asked for the source code for the driver for the IR receiver, and for the driver for the  CATV RF digital transceiver, and for the peer code that was driving the cable digital that ran on the main device.

The driver for the CATF RF digital transceiver was handed to me the first time I asked.  And by "handed to me" I mean that I was pointed to where it was sitting in the source repo.


The business partner / hardware supplier who was supplying the IR glue and drivers, after giving me a runaround, finally just flat out refused, citing trade secrets, confidentiality, secret sauce, and similar bullshit.

So, I finally "stole" the source code with a disassembler.  And found the sources of many of my problems.  It was complete shit.  "Unexpected" input from the silicon would cause wild random pointer writes.  And random sunlight on the receiver optics would cause it.  "Expected" input of undefined remote commands wasn't much better, generating and handing back blocks of garbage with incorrect block length headers.

I ended up writing, nearly from scratch, a replacement IR receiver driver.


The peer device driver code was written by a developer in a different group in my same company.  I finally got the P4 ACLs to read it after loudly escalating, over the objections of it's developer and his group manager.  It was also complete shit.  I cannot even begin to remember everything that was wrong with it, but I not only figured out may of the sources of my own pain, I also found a significant source of crash and lockup bugs that afflicted the main device.

I was not allowed to rewrite the peer code, as it was not in my remit.  However, I was able to sneak in and check in a large number of asserts using the excuse that they were "inline documentation".


On, and the device driver for the CATF RF digital transceiver?  The source code I got for the asking, without a fight?  When I reviewed it, it was easy to understand, efficient, elegant, and as far as I could tell, bug free.


In the end, I made my part work.  It just took over two months instead of the original guesstimate of less than two weeks.  This caused a schedule slip in the release of the satellite box.  Which would have been a more serious problem, except…


Except there was also major schedule slip for the main box.  A significant reason for that slip was because the peer code that I had filled with asserts was now itself crashing with assertion failures instead of emitting garbage to crash my code.)  I was lucky that I was not more officially "blamed" for that.  The reason why I wasn't, was mainly because the people who understood what I did understood the problem, and the executives who didn't understand what the problem was were also too clueless to blame anyone, let alone me.


My lesson learned from this experience is: if someone is refusing to show the source to suspect driver code, citing trade secrets, confidentiality, secret sauce, partnership agreements, or similar excuses, it's not because they are protecting their magic.  It's because they have screwed up, and they are trying to hide it.

A second rule of thumb I have is: source control systems that have complex read ACLs, e.g. don't allow any arbitrary developer to check out and review any arbitrary source code file, are expressions of moral failure.