Russ’ 10 Ingredient Recipe For Making 1 Million TPS On $5K Hardware

I wrote a blog on highscalability.com on how to get to 1 million Database TPS on $5000 worth of hardware (single machine). Hopefully there are some performance tips in the blog that people can use themselves.

http://highscalability.com/blog/2012/9/10/russ-10-ingredient-recipe-for-making-1-million-tps-on-5k-har.html

I also wrote a brag number post for Aerospike’s blog, and they let me quote the movie “Ricky Bobby”

http://www.aerospike.com/blog/all-about-speed/

I will update this blog every time I write blogs in other places, just seems like a  good idea

- Russ

Posted in Uncategorized | Leave a comment

AlchemyDB going to top of Aerospike

Aerospike is the former Citrusleaf: http://www.dbms2.com/2012/08/27/aerospike-the-former-citrusleaf/

Citrusleaf acquired AlchemyDB and we are now incrementally porting AlchemyDB functionality to run on top of Citrusleaf’s proven distributed high-availability linearly-scalable key-value store. First functionalities planned are: Lua with DocumentStore functionality, Secondary Indexes, and Real-time map-reduce. Further down the road Pregel like Distributed GraphDB like functionality or some next generation StreamingDB may be integrated in. Incrementally building AlchemyDB on top of Citrusleaf will create a distributed computing fabric, functions can be shipped to data (that lies on a horizontally scalable low latency storage layer), and the functions can propagate their results across the fabric, calling other functions on them.

Full info at: http://www.aerospike.com/blog/alchemydb/

Posted in Uncategorized | Leave a comment

3 blog kings of data

I get my information on databases, datastores, big-data, etc… primarily from 3 bloggers: Todd Hoff of High Scalability, Alex Popescu of myNOSQL, and Curt Monash of DBMS2. Each of them specialises in different areas and each has their own style and purpose, taken as a whole, they cover a lot of ground w/o a lot of dilution.

In 2-10 years you will look back at Todd Hoff’s blog High Scalability and it will contain everything that is happening right now. He has a keen insight into which technologies are fads and which technologies may lead to big changes, and he is not at all full of shit, which is next to impossible given this task. He dabbles in reporting on truly innovative technologies and he actually understands them. His style is strictly-facts and he rarely bad mouths people/ideas/stuff. His blog is the remedy for the hardened cynic that thinks we are making no important advances. His weekly “Stuff the Internet says on Scalability”, is ALWAYS good for at least 3 links to stuff I find interesting (which is 3 links higher than every other summarized email I get).

Alex Popescu is the go to guy for NOSQL. Alex Popescu’s blog myNSOSQL covers the ENTIRE NOSQL gambit (GraphDB’s, DocumentStores, KeyValueStores, Hadoop/Mapreduce, etc…). He quickly sees thru most of the bullshit in the NOSQL world, and clearly explains the differences in a movement that is full of confusion. He likes to tear people’s points apart, his points are valid and he also blasts NOSQL ideas/approaches/etc… when they have it coming. His tweet stream (@al3xandru) is a raging river of NOSQL information, it will keep you on top of the NOSQL game (once you learn how to wade thru it), plug into it, and you can be up on most of NOSQL in probably a month.

Curt Monash knows the RDBMS market, especially the analytics market, better than you know anything :) He has been in the game forever, and he earns money w/ his blog DBMS2, so it cant be called 100% objective, but he has the type of abrasive personality that is only comfortable telling mostly the truth, so IMO the info in his blog is basically objective. RDBMS technologies are relatively very mature, advanced, and widespread, so having a good summary of what is going on in that market is a MUST for any fan of data. Monash is a definitions junky, which can be boring/tiring to read, but it represents a mature approach and does help make sense of such a large, complicated, and polluted-by-enterprise-generated-bullshit market. Having a guy w/ such experience who is still very up to date, reporting on one of the oldest (yet most active) fields in computing is of great value to all of us.

In conclusion:
Real smart people are reading these 3 blogs, learning what other real smart people are doing, and forming new even smarter ideas. For this to happen a medium of information exchange that does not waste smart peoples’ time, doesn’t insult smart peoples’ intelligence w/ obvious marketing ploys, and is written in a style that sparks their imagination, is required. So go read their blogs, you will learn stuff :)

Posted in Uncategorized | 1 Comment

AlchemyDB – The world’s first integrated GraphDB + RDBMS + KV Store + Document Store

I recently added a fairly feature rich Graph Database to AlchemyDB (called it LuaGraphDB) and it took roughly 10 days to prototype. I implemented the graph traversal logic in Lua (embedded in AlchemyDB) and used AlchemyDB’s RDBMS to index the data. The API for the GraphDB is modeled after the very advanced GraphDB Neo4j. Another recently added functionality in AlchemyDB, a column type that stores a Lua Table (called it LuaTable), led me to mix Lua-function-call-syntax into every part of SQL I could fit it into (effectively tacking on Document-Store functionality to AlchemyDB). Being able to call lua functions from any place in SQL and being able to call lua functions (that can call into the data-store) directly from the client, made building a GraphDB on top of AlchemyDB possible as a library, i.e. it didn’t require any new core functionality. This level of extensibility is unique and I am gonna refer to AlchemyDB as a “Data Platform”. This is the best term I can come up with, I am great at writing cache invalidation algorithms, but I suck at naming things :)

To elaborate what I mean when I say I mixed Lua-function-call-syntax into SQL (because it’s not clear and maybe not even good english:), here is an example AlchemyDB SQL call, w/ a bunch of LUA mixed in:
SELECT func1(colX), func2(nested.obj.x.y.z), col5 FROM table WHERE index = 33 AND func3(col3) ORDER BY func4(colX)

This request does the following:
1.) finds all rows from table w/ index = 33 (RDBMS secondary index traversal)
2.) filter rows out that don’t match lua function call: func3(col3) { func3() returns [true,false]}
3.) for every row that passes #2, run the functions [func1(colX), func2(nested.obj.x.y.z)] where ‘func1′ & ‘func2′ are previously defined lua functions, and ‘colX’ is a normal column and ‘nested.obj.x.y.z’ is a nested element in a LuaTable column, and combine the return values from the two function calls with col5′s contents to form a response row.
4.) these response rows are then sorted using the return value of the function call ‘func4(colX)’ as the cmp() function.

Besides SELECT calls, Lua-function-call-syntax has also been mixed into the SQL commands: INSERT, UPDATE, & DELETE, covering the 4 horseman of SQL. Full syntax found here.

In and of itself, this mixing of Lua-function-call-syntax into SQL could be viewed as a very developer friendly User-Defined-Functions mechanism and it is that. But the integration of Lua in AlchemyDB is much deeper, and it is the deepness of the integration that opens up new ways of programming within a datastore.

Alchemy’s GraphDB implemented its graph traversal logic in Lua (found here). Any global Lua function that was turned into byte-code (or possibly LuaJIT’ed to machine code) during interpretation is callable from the client via the LUAFUNC command. These Lua functions can call into the data-store (e.g. to get/set data) w/ the Lua alchemy() function. SQL Insert/Update/Delete triggers can be created via the LUATRIGGER command, enabling SQL commands to pass row data to Lua. The ability for both languages to call each other, done in a highly efficient manner using Lua’s virtual stack calling lua byte-code (meaning no interpretation in these cross language calls), allows the developer to treat AlchemyDB as if it were an AppServer w/ an embedded RDBMS (which BTW is another Alchemy Experiment).

A simple example of the Alchemy commands “LUAFUNC” and the Lua function “alchemy(,,,)” would be the function addSqlUserRowAndNode which creates a SQL row, and adds a GraphNode (via the createNamedNode() call) inside this row’s LUATABLE column.

function addSqlUserRowAndNode(pk, citypk, nodename)
alchemy('INSERT', 'INTO', 'users', 'VALUES', "(" .. pk .. ", " .. citypk .. ", {})");
alchemy('SELECT',"createNamedNode('users', 'lo', pk, '" .. nodename .."')", 'FROM', 'users', 'WHERE', 'pk = ' .. pk);
return "OK";
end

The lua function addSqlUserRowAndNode() can be called from the frontend w/ the following command
./alchemy-cli LUAFUNC addSqlUserRowAndNode 1 10 'A'

It is worth noting (at some point, why not here:) that as long as you keep requests relatively simple, meaning they dont look at 1000 table-rows or traverse 1000 graph-nodes, your performance will range between 10K-100K TPS on a single core w/ single millisecond latencies, these are the types of numbers people should demand for OLTP.

The most challenging feature of the GraphDB was the indexing of arbitrary relationships, which does not lend itself well to SQL’s CREATE INDEX syntax. In the example on the LuaGraphDB documentation page, the indexed relationship is “Users who HAS_VISITED CityX”, but the essence of indexing arbitrary relationships means indexing “Anything w/ SOME_RELATIONSHIP to SomethingElse (in a given or both directions) (possibly at a certain depth)”. And these indexed relationships can be nested graph relationships (i.e. Dude who KNOWS people who KNOW people who HAVE lotsOfCash), it is a bitch to frame in SQL’s syntax, so it made sense to keep the logic of indexing in Lua, for this use-case.

AlchemyDB’s GraphDB implemented the indexing of relationships using AlchemyDB’s Pure Lua Function Index (which for the record may be one of the top 5 worst names ever:). This index constructs & destructs itself via user defined lua function calls (i.e. you have to write them) declared in the CREATE INDEX statement, and leaves the population of the index (i.e. addToIndex() & deleteFromIndex()) to user defined functions (again you have to write them) AND it allows the index to be used in SQL’s where-clause, exactly the same as an indexed column is used (this part you dont have to write:). The GraphDB Pure-Lua-Function-Indexes every “USER who HAS_VISITED CityX” inside the Lua routines responsible for addRelationships() via function hooks registered during “construction” (sounds fuck all tricky, but its 10 lines of code). If someone then wants to find all the users who have visited CityX, it can be done in SQL, w/ a query like “SELECT * FROM users WHERE pure_lua_func_index() = CityX.pk”. A SQL query like this could also be used to find multiple start nodes (via an indexed lookup) in a complex graph traversal. This level of flexibility is not needed in most use cases, but it again represents a cyclical calling ability (Lua<->SQL) on a very deep level, and cyclical calling enables usecases².

This little experiment houses a RDBMS, a Nosql datastore (redis), a Document-Store (via the LUATABLE column type), and a GraphDB under ONE roof. They are unified, to a sane degree, by extending SQL and they can be integrated tightly via Lua function calls, as is needed. Any use-case that requires multiple types of data-stores and serves high velocity OLTP requests and fits w/in AlchemyDB’s various other quirks (InRam, single-threaded, single-node), can benefit greatly by using this approach to reduce the number of moving parts in their system, not need to sync data across/between data-stores, not need to store data redundantly, etc…

But the part I find really exciting about this: is that I added a GraphDB to a datastore in ten days, and the GraphDB is not too shabby. AlchemyDB’s bindings open up the possibility to add functionality to an already very able data-store in a highly extensible manner. It allows you to move your functionality to your data, and THIS is HUGE. It should be easy to create all sorts of efficient data-stores & data-based-services using this Data-platform, the GraphDB implementation proves the case.

And yes this was a lot of fun to code, and it was confusing as hell until it all came together and then it was surprisingly easy to code, it fit together smoothly, it felt right, and a GraphDB is neither a small nor a simple piece of software.

AlchemyDB: the lightweight OLTP Data Platform. Strong on functionality, flexibility, and being thoroughly misunderstood :)

p.s. The GraphDB code is brand new, and has been tested, but not enough, so if you use it be aware, there may be some bugs, but I’m busting my ass getting it production hardened, so if you have problems, just shoot me an email, I will fix’em. The branch for this code is called release_0.2_rc1.

Posted in Alchemy Database, redis | 14 Comments

The case for Datastore-Side-Scripting

I have recently posted on predictions that real time web applications are going in the direction of being entirely event driven, from client (WebSockets) to web-server (Node.js) to datastore (Redisql). My current project: Alchemy Database (formerly known as Redisql) hopes to be the final link in this event driven chain. Alchemy Database is taking a new step towards reducing communication between web-server and datastore (thereby increasing thruput), by implementing Datastore-Side-Scripting.

Often, in web applications, the communication between the web-server and the datastore requires 2+ trips involving sequential requests. For example: first the session is validated and then IFF the session is valid, the request’s data is retrieved from the datastore. The web-server must first wait for the “is-session-valid-lookup” to issue the “get-me-data-for-this-request-lookup” (the latter can even be a set of sequential lookups), which implies a good deal of blocking and waiting in the web-server’s backend. Having 2+ sequential steps in web-server datastore communication may seem trivial, but these steps are commonly the cause of SYSTEM bottlenecks. A single frontend request results in webserver threads blocking and waiting multiple times on sequential datastore requests. Each {block, wait, wake-up} in the web-server means 2 context switches, in addition to the latency of 1+ tcp request(s), which are 4-6 orders of magnitude slower than a RAM lookup.

In such cases, if sequential web-server-datastore request-response pairs can be reduced to a single request/response pair, overall system performance will increase substantially and the system will become more predictable/stable (fewer context-switches, less waiting, less requests, less I/O). Datastore-side-scripting can accomplish this, by pushing trivial logic into the datastore itself. In the example above, the only logic being performed is a “if(session_valid)” which has the cost of 2 context switches and a 2+ fold increase in response duration … which is absurd. In this use-case, pushing the “if(session_valid)” into the datastore makes sense on ALL levels.

Some argue, introducing scripting in the datastore is adding logic to the classic bottleneck in 3 tiered architectures, it will exacerbate bottlenecking, it is BAD. This point is valid in theory. In practice, the main bottleneck in ultra-high-performance in-memory-databases is network I/O. Adding NICs, not adding CPUs (or cores) is a better bet to scale vertically (ref: Handlersocket blog). This means, computers running Alchemy Database spend most of their time transporting packets, on request from: NIC->RAM->CPU and then on response from: CPU->RAM->NIC. The operating system’s marshaling of TCP packets takes up far more resources/time than the trivial in-memory lookup to GET/SET the request’s data. Meaning if a few more trivial commands (e.g. an IF block) are packed into the TCP request packet, the request’s duration is not significantly effected, but the overall system is benefited greatly, by taking (very lengthy) blocking/waiting steps in the web-server out of the equation.

Another name for Datastore-side-scripting is “Stored Procedures”, a term w/ lots of baggage. Stored Procedures are flawed on many levels, yet there was and is a need for them. The reality of Stored Procedures is their syntax is ugly in SQL and they open up all sorts of possibilities for developers, which have often been abused. Yet the basic concept of pushing logic into the database was recently mentioned as a future goal for the NOSQL movement. Tokyo Cabinet has embedded Lua deeply into its product (w/ 200 Lua commands), to allow datastore-side-scripting. Voltdb, a next generation SQL server, supports ONLY Stored Procedures and argues that in production they make more sense than ad-hoc SQL.

Stored Procedures are done in a hacked together SQL-like syntax, Datastore-side-scripting is done by embedding Lua. “The Lua programming language is a small scripting language specifically designed to be embedded in other programs. Lua’s C API allows exceptionally clean and simple code both to call Lua from C, and to call C from Lua”.

Lua provides a full (and tested) scripting language and you can define functions in Lua that access your datastore natively in C. I will explain this further, as it confused me at first. Alchemy Database has Lua embedded into it’s server, both Alchemy Database and the Lua engine are written in C. Alchemy Database will pass text given as the 1st argument of the Alchemy Database “LUA” command to the Lua engine. Additionally Alchemy Database has a Lua function called “client()” that can make Alchemy Database SET/GET/INSERT/SELECT/etc.. calls from w/in Lua. The Lua “client” function is written in C via Lua’s C bindings, so calling “client()” in Lua is actually running code natively in C. See its confusing, but the net effect is the following Alchemy Database command:
LUA 'user_id = client("GET", "session:XYZ"); if (user_id) then return client("GET", "data:" .. user_id); else return ""; end'
will perform the use case described above in a single webserver-to-datastore request and the Lua code is pretty easy to understand/write/extend.

Alchemy Database’s datastore-side-scripting was implemented in about 200 lines of C code, and defines a single Lua function “client()” in C, yet it opens up a whole new level of functionality. Performance tests have shown that running Lua has about a 42% performance hit server-side (i.e. 50K req/s) as compared to a single native Alchemy Database call (e.g. SET vs. LUA ‘return client(“SET”);), but packing multiple “client()” calls into a single LUA function does not cause a linear slowdown, meaning use cases that bottleneck on sequential webserver-datastore I/O will see a HUGE performance boost from pushing logic into the datastore.

I do want to stress Alchemy Database’s datastore-side-scripting is meant to be used w/ caution. Proper usage (IMO) of Lua in Alchemy Database is to implement very trivial logic datastore-side (e.g. do not do a for loop w/ 10K lookups), especially because Alchemy Database is single threaded and long running queries block other queries during their execution. Recommended usage is to try and pack trivial logical blocks into a single datastore request, effectively avoiding sequential webserver-to-datastore requests. Lua scripting does open up the possibility for map-reduce, Alchemy Database-as-a-RDBMS-proxy/cache, (and for pathological hackers) even usage as a zero-security front-end server … but that was not the intent of embedding Lua; if you use Lua in Alchemy Database in this manner, you are hacking, please know what you are doing.

Datastore-side-scripting is a balancing act that can be exploited for gain by careful/savvy developers or wrongly implemented for loss by sloppy programming/data-modelling … It gives more power to the Developer and consequently the developer needs to wield said power wisely.

Posted in Alchemy Database, concurrency, node.js, Redisql | 15 Comments

Redisql: the lightning fast data polyglot

For about a year, I have been using the NOSQL datastore redis, in various web-serving environments, as a very fast backend to store and retrieve key-value data and data that best fits in lists, sets, and hash-tables. In addition to redis, my backend also employed mysql, because some data fits much better in a relational table. Getting certain types of data to fit into redis data objects would have added to the complexity of the system and in some cases: it’s simply not doable. BUT, I hated having 2 data-stores, especially when one (mysql) is fundamentally slower, this created a misbalance in how my code was architected. The Mysql calls can take orders of magnitude longer to execute, which is exacerbated when traffic surges. So I wrote Redisql which is an extension of redis that also supports a large subset of SQL. The Idea was to have a single roof to house both relational data and redis data and both types of data would exhibit similar lookup/insert latencies under similar concurrency levels, i.e. a balanced backend.

Redisql supports all redis data types and functionality (as it’s an extension of redis) and it also supports SQL SELECT/INSERT/UPDATE/DELETE (including joins, range-queries, multiple indices, etc…) -> lots of SQL, short of stuff like nested joins and Datawarehousing functionality (e.g. FOREIGN KEY CONSTRAINTS). So using a Redisql library (in your environment’s native language), you can either call redis operations on redis data objects or SQL operations on relational tables, its all in one server accessed from one library. Redisql morph commands convert relational tables (including range query and join results) into sets of redis data objects. They can also convert the results of redis commands on redis data objects into relational tables. Denormalization from relation tables to sets of redis hash-tables is possible, as is normalization from sets of redis hash-tables (or sets of redis keys) into relational tables. Data can be reordered and shuffled into the data structure (relational table, list, set, hash-table, OR ordered-set) that best fits your use cases, and the archiving of redis data objects into relational tables is made possible.

Not only is all the data under a single data roof in Redisql, but the lookup/insert speeds are uniform, you can predict the speed of a SET, an INSERT, an LPOP, a SELECT range query … so application code runs w/o kinks (no unexpected bizarro waits due to mysql table locks -> that lock up an apache thread -> that decrease the performance of a single machine -> which creates an imbalance in the cluster).

Uniform data access patterns between front-end and back-end can fundamentally change how application code behaves. On a 3.0Ghz CPU core, Redis SET/GET run at 110K/s and Redisql INSERT/SELECT run at 95K/s, both w/ sub millisecond mean-latencies, so all of a sudden the application server can fetch data from the datastore w/ truly minimal delay. The oh-so-common bottleneck: “I/O between app-server and datastore” is cut to a bare minimum, which can even push the bottleneck back into the app-servers, and that’s great news as app-servers are dead simple (e.g. add server) to scale horizontally. Redisql is an event-driven non-blocking asynchronous-I/O in-memory database, which i have dubbed an Evented Relational Database, for brevity’s sake.

During the development of Redisql, it became evident that optimizing the number of bytes a row occupied was an incredibly important metric, as Redisql is an In-Memory database (w/ disk persistence snapshotting). Unlike redis, Redisql can function if you go into swap space, but this should be done w/ extreme care. Redisql has lots of memory optimisations, it has been written from the ground up to allow you to put as much data as is possible into your machine’s RAM. Relational table per-row overhead is minimal and TEXT columns are stored in compressed form, when possible (using algorithms w/ negligible performance hits). Analogous to providing predictable request latencies at high concurrency levels, Redisql gives predictable memory usage overhead for data storage and provides detailed per-table, per-index memory usage via the SQL DESC command, as well as per row memory usage via the “INSERT … RETURN SIZE” command. The predictability of Redisql, which translates into tweakability for the seasoned programmer, changes the traditional programming landscape where the datastore is slower than the app-server.

Redisql is architected to handle the c10K problem, so it is world class in terms of networking speed AND all of Redisql’s data is in RAM, so there are no hard disk seeks to engineer around, you get all your data in a predictably FAST manner AND you can pack a lot of data into RAM as Redisql aggressively minimizes memory usage AND Redisql combines SQL and NOSQL under one roof, unifying them w/ commands to morph data betwixt them …. the sum of these parts, when integrated correctly w/ a fast app-server architecture is unbeatable as a dynamic web page serving platform with low latency at high concurrency.

The goal of Redisql is to be the complete datastore solution for applications that require the fastest data lookups/inserts possible. Pairing Redisql w/ an event driven language like Node.js, Ruby Eventmachine, or Twisted Python, should yield a dynamic web page serving platform capable of unheard of low latency at high concurrency, which when paired w/ intelligent client side programming, could process user events in the browser quickly enough to finally realize the browser as an applications platform.

Redisql: the polyglot that speaks SQL and redis, was written to be the Evented Relational Database, the missing piece in the 100% event driven architecture spanning from browser to app-server to database-server and back.

Posted in concurrency, node.js, redis, Redisql | 7 Comments

subway improvement

I live in silicon valley, and I am forever going to San Fran or San Jose on the caltrain. The caltrain is clean, but its too damn slow.

Then I thought about why its slow. It makes too many stops. Every stop it needs to slow down, stop, people board, it speeds up again. The overhead involved in the stops take up all the time, not the actual travelling.

We all want more stops, so statistically the caltrain is closer to our front door, but its the stops that are slowing the caltrain down, the express trains get me to San Fran 2X quicker, but they serve maybe 1/5 the stops, so they are not a solution either.

So this post (which is just for the record, cause no doubt it already exists in Japan) wants to present an idea to abstract out the stops.

Caltrain is not the best or worst example, but the simplest example is to imagine a subway line running from east to west in a major city, spanning 50 miles and having 25 stops. If the train can start at the east side and travel at 50 miles an hour w/o stop, you can traverse the city in an hour regardless of traffic, its optimal for moving people through a city.

So the tricky part is when someone gets on at mile 20 and wants to get off at mile 22 (which would be city center and the most common usage).

So you have little mini trains at each stop. You board them, and their doors close one minute before the big train (running at constant 50mph) passes the station. Once the doors of the mini train close the mini train starts to accelerate on a mini track next to the main track. The timing is such that when the mini-train reaches 50mph the main train is right next to it, and they dock in motion, allowing people to board (from mini-train to big-train) and if the next stop is close, people can disembark from the big-train to the mini-train which will then close its door and coast to the next stop.

The whole point is to have the big train move at a constant speed, never stopping. Take the stops out of the equation and mass transit would be FAST.

Just an idea, it is a hot spell in Silicon Valley, hard to sleep, my mind wanders :)

Update: I ran this by my friend faisal, and he has an idea to treat the train cars like lego blocks, opening up all sorts of new strategies, then i felt inspired and drew a pathetic one minute sketch on my tablet …

Posted in Rants | 3 Comments