2010-03-31

Just pump IP datagrams

Whenever someone designs a fancy complex network stack, the first thing any buyer/user wants to do is run the internet over it.

Said designer will then, grudgingly, write an appendix standard, usually calling it something like "legacy protocol support".

And that is the only part that anyone actually uses.

And then people paying the bill for the network notices all the processing and bits-on-the-wire overhead they are paying for that they never use, and demand that layers of complexity and cost be peeled off and thrown away, so that carrying IP datagrams be cheaper and faster.

This gets iterated down to the underlying real physical wire.

For example:

  1. "IP over PPP over V.44bis over dialup POTS" and "IP over ISDN", to
  2. "IP over SS7 T1/E1", to
  3. "IP over ATM", to
  4. "IP over SONET", and now
  5. "IP over WDM" and "IP over lightwave"

Quit trying. Just give up, and cut to the chase. Whenever you have an idea for pumping bits over any distance, either as technology startup, an industry consortium, a standards body, or even a maker/hacker playing around with open source hardware and unlicensed spectrum, just set it up to carry IPv4 and IPv6 datagrams as close as possible to the real underlying carrier (using PPP if really necessary), and then stop.

The only possible sane exception would be adding FEC (forward error correction) and detecting and retransmitting on carrier and bit garbling (like what 802.11 does).

But, don't waste your time and anyone else's money on anything more.

2010-03-16

Reacting to "Memcached is not a store"

I keep seeing "Memcached is not a key value store. It is a cache. Hence the name." This is strongly reinforced by statements made in the memcached mailing list itself.

This is short sighted.

Memcached is a number of things. It is an idea (fast key value store, with distributed hash function scaling), it is a network protocol (two of them, in fact), it is a selection of client libraries and APIs (most based on libmemcached), and it is a server implementation. In fact, now, is is now a number of server implementations, because now there are a number of different things that implement the memcached protocol.

Only one of which is the open source community edition of the memcached server, version 1.4, downloadable from http://memcached.org/

Despite what you may get told, especially on the memcached mailing list, you can in fact use memcached as a store, not just as a cache.

Yes,if you fill it up, it may either error or evict something, but you have to live with if you fill up a MySQL tablespace disk, it crashes or errors on you too. Yes, it's not persistent. But, frankly, there is a lot of data that doesn't HAVE to be persistent, or at least doesn't have to persist in that format. MySQL, Postgres, Oracle, and MSSQL have temp tables and memory tables, and people are ok with that.

But it's not even "not persistent". You can get various kinds of "persistent memcached" by using Gear6 which can persist to flash, or by using Tokyo Cabinet, which has a memcached interface. There are a number of other stores that are adding memcached interfaces. I have seen people working to use libmemcached to add memcache "memcapable" interfaces to NDB cluster, to "embedded" InnoDB, and even to the filesystem. And more are coming.

I've also seen work to do crazy stuff like put a full relational SQL interface on top off of memcached, or a simple "cut down" SQL that just does primary key inserts and lookups for the "e-blob pattern", or the NDB Cluster API on top of it. It's all good.

So when someone sneers "memcache is a cache, not a store", don't be discouraged. They have reasons to say that, but IMO, they are not very good reasons.

Use it as a store, just know what it does and doesn't do. Just like when you chose any other data store.

Attending MySQLcon, round 4

I will be at the MySQL Conference in Santa Clara in a few weeks in the middle of April. This will be my fourth time there. Gear6 is going to be there as a sponsor, and some of my coworkers will be presenting, and I will probably share some stage time with them as they do.

My first time, was my first time. I was still working solo, as an independent. I met many MySQL-sphere people who are now personal friends, as well as being industry peers and contacts. I gave the talk that "put me on the map", for a project that was my opportunity to learn about distributed version control, cloud computing, Amazon Web Services, MySQL internals, and MySQL storage engines. I would look at the presentation schedule, and usually not even understand the titles.

My second year, I was working as part of MySQL Professional Services. My presentation was about MySQL User Defined Functions. The major buzz was "What will Sun do?". The technology arising that I was the most aware of was Memcached. One of the first memcached hackathons happened. The business thing I became aware of was Rackspace's gearing up to enter the cloud computing space. I learned later that the seeds of Drizzle DB germinated there. I would look at the presentation titles, and wish I could attend more presentations, because I could learn from almost all of them.

My third year, last year, I was working on Sun Cloud. The major buzz was "What will Oracle do?" and about the separate Percona conference and similar political/personal stuff. The tech buzz was about Drizzle DB. I would look at the presentation titles, and realize I could teach most of the presentations myself, and so focused mainly on network and on the "hallway track".

It will be fun to see what this year brings.

So tired of MySQL's old "GPLed network protocol" FUD.

Over in the blog post High Availability MySQL: Can a Protocol be GPL? Mark Callahan found the following comment in the source file sql/net_serv.c

This file is the net layer API for the MySQL client/server protocol, which is a tightly coupled, proprietary protocol owned by MySQL AB. Any re-implementations of this protocol must also be under GPL, unless one has got an license from MySQL AB stating otherwise.

I am second to few in being a fan and a proponent of the GPL.  However, this claim in this comment is utterly bogus. It dates back to willful misunderstanding and FUD spreading and the desire to strong-arm the sale of licenses on the part some people at the old MySQL AB company.

If license of source code followed network protocol of running implementation (which stretches "derived work" to an extent even an RIAA flack dare not even dream), then it would be illegal to view a web page served from IIS from Firefox.

And besides, the MySQL client server protocol is not even "owned" by MySQL AB.  It's very directly merely extended (and not very much) from the old MSQL client server protocol.

To moot the entire issue, the new libdrizzle client protocol library not only speaks the new different and superior Drizzle client/server protocol, it also speaks the old MySQL protocol. The license for libdrizzle is BSD.

This whole thing makes me annoyed, because this kind of overreach and FUD just makes the GPL itself look bad.

Why companies need to care about "social media"

Part of my day job now is to keep up with the "social media" stuff for my employer and for the technologies we play in.  And while I know it's important, and that it's part of my job, I've often been really annoyed by the cohorts of nearly useless "social media experts" who say very little and charge huge bills for knowing how to search twitter.
But I saw today on Chris Messina's Buzz about "The Browser as Social Agent, part 3" that really states simply and directly why corporations and organizations, especially non-tech companies, should care about social media stuff:
I noticed a sign on the wall that I’d not seen before, providing links to that local Whole Foods’ Twitter and Facebook pages. It struck me as rather strange that a company like Whole Foods would promote their profiles on networks owned by other companies until I got out of my tech bubble mindset for a moment and realized how irrelevant Whole Foods’ homepage must seem to people who are now used to following friends’ and celebrities’ activities on sites like Twitter and Facebook. What are you supposed to do with a link to a homepage these days? Bookmark it? — only to lose it among the thousands of other bookmarks you already forgot about?

2010-03-13

Snuggies, Netbooks, and NoSQL

A few days ago, a friend of mine sent me an Instructables link on modifying  a Snuggie to make it more usable with a netbook.  I don't like their approach, which consists of cutting a hole in the Snuggie so the netbook can rest on your legs and not slide down.  A better approach would be to sew or glue a grippy material on the right spot for the netbook.

And then we started rapping on how to modify an array of Snuggies and Netbooks to provide a NoSQL service:

First, get a bunch of netbooks, then glue them to the Snuggies.  Add an extra hole for each netbook so you don't have to get your hands cold while you attempt to turn them all into servers. Then you cut a hole for the distributed hash table to get through so your arms don't get cold, but conveniently, you don't have to build an extra tube for every table's relational guarantees because nosql does away with that silliness. Gone are the hours spent hand-sewing each tube on for each netbook, which makes this system way more scalable for your average snuggie, though results may vary, and you can sew the extra tubes on if you want.  Then embroider on your associative arrays or key-value pairs, whichever you want to use. Oh, and remember to sew with scant quarter-inch seams - don't waste extra fabric, quarter-inch seams scale better anyway...

Revisiting and defending my Tweets from NoSQL Live in Boston

Some people have been confused by my Twitter stream from last week's NoSQL Live in Boston conference.  I've never been very good at staying "on message", or adhering to a set of "talking points".  Some people thought I was there to "defend the status quo", or to defend and promote memcached, or memcached+mysql.

True, I was there in part to teach about and promote memcached, and especially Gear6 Memcached.  I have a great employer, but they are not sending me to conferences because they are a charity.  I taught the memcached  breakout session, and also worked the "hallway track", giving a number of people a crash course in what memcache is and what makes it useful, and handed out a pile of business cards.

As for my tweets and pithy statements... Well, some over-simplification has to happen to reduce a concept to 140 characters

My statement, "NoSQL as cheaper to deploy, manage, and maintain is a myth. it costs just as much, if not more", which I said that into the mike at the start of the scaling panel, was very popular to tweet and retweet.

It's something that is a "Jedi Truth", it's true from a "certain point of view".  When you add together the costs of the much shallower "bench" of hirable operational experience, the increased "minimum useful hardware footprint"  (which seems to be about at least 5 full storage nodes), and the evolving maturity of the client libraries, and such, NoSQL is not going to save you money.  Until an important threshold is reached.  When you scale and/or your data representation "impedance mismatch" hits that nasty inflection point, where the "buck for bang" curve suddenly starts to rise hard, and it looks like you will need to start spending infinite amounts of money to keep growing.  Then the NoSQL approach does become cheaper, because it's actually doable.  And it's probably wise to start considering, researching, and then migrating to NoSQL before you hit that wall.

My statement "people have been wanting DBA-less databases about as long as they have wanted programmer-less programming languages" was also popular.  I stand by it.  NoSQL doesn't crack this nut, nothing ever well.  Some NoSQL solutions look like they are "DBA-less", such as AWS SDB, AWS RDB, FluidDB.  Those systems are not DBA-less, they have DBAs, just that the cost of the DBA is "hidden" in the per-drink rental cost of those systems, instead of sitting on your balance sheet as a salary.

The statement "Twitter is using Cassandra because bursty writes are cheap, compared to others" is something I said not because I knew it, but because I just learned it, and I was a bit surprised by it.  I think that the original statement was by Ryan King of Twitter, who was also on the scaling panel.

My statement "Memcached should be integrated into all NoSQL stores" is something I also firmly believe.  The very-high-performance in-memory distributed key value store is a very useful building block for larger systems, and I think that whatever larger NoSQL systems we end up will use it as a component of their internal implementations.

The statement "being able to drop nodes as important as being able to add, because scalability is pointless w/o reliability" was also by Ryan King.  I tweeted it because it is very much something worth broadcasting and remembering it.  It has a little more context in his next statement "The first day we stood up our #cassandra cluster, 2 nodes had hdd die... Clients never noticed."  Machines fail.  And as they get faster and cheaper, and as clusters get bigger, machine failure must become something that must not be any sort of emergency.

My statement "open source means folks dont need a standards body" was a extreme simplication of part of Sandro Hawke's talk.  I tweeted it because it was something I've felt to be mostly true enough for a long time, and it was nice to see someone else recognize it.  As Sandro stated later in twitter, "I think I added an important "sometimes"!".  And he is correct.  It's not true as an absolute statement.



All in all, NoSQL Live was a very good conference.  I felt that the speakers taught and learned, all the other attendees taught and learned, and the networking and hallway track was first rate.  Thanks to 10gen for organizing it, and being entirely fair to their "competition" in the rapidly growing and evolving NoSQL space.

2010-03-09

Social location services need to have an XMPP IM interface

If there is any social application that needs "the real time web", and yet, nobody yet is getting it right.


Let's look at Foursquare.

If I have it write to Twitter, then all my followers, independent of their location or interest, get spammed with my location.  (Which is why I've stopped having it do so by default, and instead just push to Twitter for specific cases).


And yet if I don't have it write to Twitter, then people have to poll the website or API, to see where I have been, but they don't get alerted to where I actually am.

Neither of these are good or useful, especially if one has a lot of microblog followers, and more than a very few followers and followees on 4sq.

Spam is bad, and polling is bad.

A better way would be for me to add "updates@foursquare.com" to my chat roster.

And then when someone I follow checks in, and they are near me, I would get an IM from that, saying "Bob Smith is at Joe's Bar".

And for me to check in, I send a message to updates@foursquare.com, saying either "checkin" or "checkin Remedy" or "checkin 32417", and it will use geo IP lookups, XEP-0080 data, FireEagle data, Google Latitude data, my past checkin history, the checkin string hint, or the actual venue id, to figure out that I am at http://foursquare.com/venue/32417

People who want to see where I have been can just poll the RSS/Atom feed of my checkin history.  If that Atom feed implements PubSubHubBub, they don't even need to poll it.

A desire for an even more open and even more public geolocation database

The world needs better well-known geolocation place database.


The idea/vision I'm having right now looks like this:

The raw data is sitting FluidDB or AWS S3 or something like it, e.g. a public readable and updatable key value store.

In it there could be stored the wardriving results of wireless network AP MAC address, tagged with lat, lon, alt, signal strength, time-seen, etc

In it there could be stored every cellphone tower everywhere, along with lat, lon, alt, tower id, carrier, technology, addressing, etc

In it there could be every possible geographic outline cluster of interest, by some sort of unique id, and tagged with its name, kind, outline, list of parent containers, etc.

Kinds would be things like "nation", "state", "provence", "region", "metro area", "city", "neighborhood", "park", etc

It would have both "official" things like nations and states, and less official things like "metro areas" and "neighborhoods".  Flickr/Yahoo has been doing some really cool stuff with doing tag+geo analysis to crowdsource compute things like "what is the area that people on the ground actually call 'Paris', which is rather different then the official border of the City Of Paris".

In addition to all this raw data, a data service that speaks REST and speaks XMPP (XEP-0080 etc etc) can sit on top of it.

A client could say "I have this raw data from my GPS, I see the following list of APs with the following strengths, I am associated with this one, I see the following list of cellphone towers at the following signal strengths, and I'm associated with this one, and the last time I saw these APs and towers, I turned out to be at one of the following location IDs".

The service would send back "You are at [place Remedy Tea; neighborhood Capitol Hill; city Seattle; county King; state WA; nation US], your exact lat/lon/alt is foo, and the id for that specific location is foo".

And then the client could say "tell me more about location id foo" and the service could send back stuff like the OSM data, FireEagle data, the Google Places info, the Yelp id, the Foursquare id, the Brightkite id, the postal address, the phone number, etc etc.  Whatever it thinks it knows about that location.

What all this needs is that it needs to be as much as possible open and public.  Which is why I suggested something like FluidDB.

Twitter geoloc ideas



The Twitter microblogging service has the ability for clients to "stamp" updates with a geolocation.  If they would extend that just a little bit, some really actually useful stuff would become possible.

Some of this would need features in the client, some would need features in the server, some could be done with work in both sides.

I would like my handheld client to give me a special alert when someone I follow does an update that is geographically near me.  As it is now, my client in my phone only alerts me when someone DMs me or mentions me, as it would be entirely too noisy if everyone I follow does anything.

It would be useful to be able to subscribe to the public timeline of some space.  For example, then a conference could put up on public monitors all the local tweets, not just the ones that match a hashtag.

Another neat thing would be if I could set some metadata in an update that means "this is probably only interesting to friends near me".  The service then could either filter it, and/or pass it out to my followers, but their client could choose to suppress it if it's not near enough.  How near is "near enough" could be set as a hint in the metadata, and/or a default my account settings, and/or by settings in my follower's clients.

Labeling a location would be neat.  For example, it would be neat if I could right now tell Twitter "loc Remedy", and instead of a generating a public update, the service says "hmm, from geoloc data and past history and past updates, I am pretty sure that he means "Remedy Tea; Capitol Hill; Seattle; WA; US".  So now my location has a name. Which can be fed back into the public latlon-to-name database as well.

Another useful inline command could be "checkin".  If I just say "checkin", then a geolocation tagged update would go out saying something like "is at $PLACENAME", where the placename is pulled from a latlon-to-name database, and/or from my recent or past "loc" history. The "loc" and "checkin" commands could even be combined, so if I send "checkin Remedy", then my local placename gets set and refined, and the announcements goes out as well.

If Twitter and/or the other microblogging services get this right, they could very well "eat" the specialized location/social services such as foursquare and brightkite.

2010-03-03

MySQL+Memcached is still the workhorse

(originally posted at the Gear6 corporate blog: MySQL+Memcached is still the workhorse.  Please comment there.)

Because I'm becoming known as someone who knows something about "this whole NoSQL thing", people have started asking me to take a look at some of their systems or ideas, and tell them which NoSQL technology they should use.

To be fair, it is a confusing space right now, there are a LOT of NoSQL technologies showing up, and there is a lot of buzz from the tech press, and in blogs and on twitter.  Most of that buzz is, frankly, ignorant and uninformed, and is being written by people who do not have enough experience running live systems.

A couple of times already, someone has described an application or concept to me, and asked "So, should I use Cassandra, or CouchDB, or what?"

And I look at what they want to do, and say "Well, for this use case, probably the best storage pattern is normalized relational tables, with a properly tuned database server.  Put some memcached in the hot spots in your application. Maybe as you get really big, you can add in replication to some read slaves."

The relational table model is not dead, at all. It will never die, nor should we try to kill it.  We no longer have to be as religious about normal form, and we don't HAVE to fit everything everywhere into this form, but there is no reason to avoid it just because it's not "sexy and exotic".

Running a real live system is not a junior high prom.  You don't "win" by showing up with a sexy exotic date, and by wearing the prettiest outfit.

Running a real live system is running a farm, ploughing a field.  You want a workhorse that you know how to use, you know you can get gear for at the blacksmith and the tackle shop, and that you know you can hire field hands who know how to use it well and take of properly.

MySQL+Memcached is, today, still, that workhorse.

2010-03-02