Monday, December 16, 2013
Holiday Coding the SageMath Cloud
I'm also continuing to work on adding a Google Compute Engine data center; this is the web server parts hosted there right now https://108.59.84.126/, but the real interesting part will be making compute nodes available, since the GCE compute nodes are very fast. I'll be making 30GB RAM 8-core instances available, so one can start a project there and just get access to that -- for free for to SMC users, despite the official price being 0ドル.829/hour. I hope this happens soon.
Tuesday, December 10, 2013
The Sagemath Cloud: a minute "elevator description"
Saturday, October 19, 2013
Jason Grout's description of the Sagemath Cloud
Jason Grout's description of the Sagemath Cloud:
William Stein, the lead developer of Sage, has been developing a new online interface to Sage, the Sage Cloud at https://cloud.sagemath.com. Currently in beta status, it is already a powerful computation and collaboration tool. Work is organized into projects which can be shared with others. Inside a project, you can create any number of files, folders, Sage worksheets, LaTeX documents, code libraries, and other resources. Real-time collaborative editing allows multiple people to edit and chat about the same document simultaneously over the web.The LaTeX editor features near real-time preview, forward and reverse search, and real-time collaboration. Also, it is easy to have Sage do computations or draw gures and have those automatically embedded into a LaTeX document using the SageTeX package (for example, after including the sagetex package, typing \sageplot{plot(sin(x))} in a TeX document inserts the plot of sin(x)). A complete Linux terminal is also available from the browser to work within the project directory. Snapshots are automatically saved and backed up every minute to ensure work is never lost. William is rapidly adding new features, often within days of a user requesting them.
Saturday, October 12, 2013
"A Symphony of Cursors" (guest post by Jason Grout)
The other day some students and I met to do some development on the Sage cell server. We each opened up our shared project on cloud.sagemath.com on our own laptops, and started going through the code. We had a specific objective. The session went something like this:
Jason: Okay, here's the function that we need to modify. We need to change this line to do X, and we need to change this other line to do Y. We also need to write this extra function and put it here, and change this other line to do Z. James: can you do X? David: can you look up somewhere on the net how to do Y and write that extra function? I'll do Z.
Then in a matter of minutes, cursors scattering out to the different parts of the code, we had the necessary changes written. I restarted the development sage cell server running inside the cloud account and we were each able to test the changes. We realized a few more things needed to be changed, we divided up the work, and in a few more minutes each had made the necessary changes.
It was amazing: watching all of the cursors scatter out into the code, each person playing a part to make the vision come true, and then quickly coming back together to regroup, reassess, and test the final complete whole. Forgive me for waxing poetic, but it was like a symphony of cursors, each playing their own tune in their lines of the code file, weaving together a beautiful harmony. This fluid syncing William wrote takes distributed development to a new level.
Thanks!
Thursday, October 3, 2013
Backing up the Sagemath Cloud
Bup
I spent a lot of time building a snapshot system for user projects on top of bup. Bup is a highly efficient de-duplicating compressed backup system built on top of git; unlike other approaches, you can store arbitrary data, huge files, etc.I looked at many open source options for making efficient de-duplicated distributed snapshots, and I think bup is overall the best, especially because the source code is readable. Right now https://cloud.sagemath.com makes several thousand bup snapshots every day, and it has practically saved people many, many hours in potentially lost work (due to them accidentally deleting or corrupting files).
You can access these snapshots by clicking on the camera icon on the right side of the file listing page.
Some lessons learned when implementing the snapshot system
- Avoid creating a large number of branches/commits -- creating an almost-empty repo, but with say 500 branches,
even with very little in them, makes things painfully slow, e.g., due to an enormous number of separate
calls to git. When users interactively get directory listings, it should take at most about 1 second to
get a listing, or they will be annoyed. I made some possibly-hackish optimization -- mainly caching --
to offset this issue, which are here in case anyone is interested: https://github.com/williamstein/bup
(I think they are too hackish to be included in bup, but anybody is welcome to them.)
- Run a regular test about how long it takes to access the file listing in the latest commit, and if it gets above a threshhold, create a new bup repo. So in fact the bup backup deamons really manage a sequence of bup repos. There are a bunch of these daemons running on different computers, and it was critical to implement locking, since in my experience bad things happen if you try to backup an account using two different bups at the same time. Right now, typically a bup repo will have about 2000 commits before I switch to another one.
- When starting a commit, I wrote code to save information about the current state, so that everything could be rolled back in case an error occurs, due to files moving, network issues, the snapshot being massive due to a nefarious user, power loss, etc. This was critical to avoid the bup repo getting corrupted, and hence broken.
- In the end, I stopped using branches, due to complexity and inefficiency, and just make all the commits in the same branch. I keep track of what is what in a separate database. Also, when making a snapshot, I record the changed files (as output by the command mentioned above) in the database with the commit, since this information can be really useful, and is impossible to get out of my backups, due to using a single branch, the bup archives being on multiple computers, and also there being multiple bup archives on each computer. NOTE: I've been recording this information for cloud.sagemath for months, but it is not yet exposed in the user interface, but will be soon.
Availability
The snapshots are distributed around the Sagemath Cloud cluster, so failure of single machines doesn't mean that backups become unavailable. I also have scripts that automatically rsync all of the snapshot repositories to machines in other locations, and keep offsite copies as well. It is thus unlikely that any file you create in cloud.sagemath could just get lost. For better or worse, is also impossible to permanently delete anything. Given the target audience of mathematicians and math students, and the terms of usage, I hope this is reasonable.Friday, September 13, 2013
IPython Notebooks in the Cloud with Realtime Synchronization and Support for Collaborators
Here's how to try it out
- Go to https://cloud.sagemath.com and make an account; this is a free service hosted on computers at University of Washington.
- Create a new project.
- Click +New, then click "IPython"; alternatively, paste in a link to an IPython notebook (e.g., anything here http://nbviewer.ipython.org/ -- you might need to get the actual link to the ipynb file itself!), or upload a file.
- An IPython notebook server will start, the given .ipynb file should load in a same-domain iframe, and then some of the ipython notebook code is and iframe contents are monkey patched, in order to support sync and better integration with https://cloud.sagemath.com.
- Open the ipynb file in multiple browsers, and see that changes in one appear in the other, including moving cells around, creating new cells, editing markdown (the rendered version appears elsewhere), etc.
IPython development
Regarding the monkey patching mentioned above, the right thing to do would be to explain exactly what hooks/changes in the IPython html client I need in order to do sync, etc., make sure these makes sense to the IPython devs, and send a pull request. As an example, in order to do sync efficiently, I have to be able to set a given cell from JSON -- it's critical to do this in place when possible, since the overhead of creating a new cell is huge (due probably to the overhead of creating CodeMirror editors); however, the fromJSON method in IPython assumes that the cell is brand new -- it would be nice to add an option to make a cell fromJSON without assuming it is empty. The ultimate outcome of this could be a clean well-defined way of doing sync for IPython notebooks using any third-party sync implementation. IPython might provide their own sync service and there are starting to be others available these days -- e.g., Google has one, and maybe Guido van Rosum helped write one for Dropbox recently?How it works
Earlier this year, I implemented Neil Fraser's differential synchronization algorithm, since I needed it for file and Sage worksheet editing in https://cloud.sagemath.com. There are many approaches to realtime synchronization, and Fraser makes a good argument for his. For example, Google Wave involved a different approach (Operational Transforms), whereas Google Drive/Docs uses Fraser's approach (and code -- he works at Google), and you can see which succeeded. The main idea of his approach is eventually stable iterative process that involves heuristically making and applying patches on a "best effort" basis; it allows for all live versions of the document to be modified simultaneously -- the only locking is during the moment when a patch is applied to the live document. He also explains how to handle packet loss gracefully. I did a complete implementation from scratch (except for using the beautiful Google diff/patch/match library). There might be a Python implementation of the algorithm as part of mobwrite.The hardest part of this project was using Fraser's algorithm, which is designed for unstructured text documents, to deal with IPython's notebook format, which is a structured JSON document. I ended up defining another less structured format for IPython notebooks, which gets used purely for synchronization and nothing else. It's a plain text file whose first line is a JSON object giving metainformation; all other lines correspond, in order, to the JSON for individual cells. When patching, it is in theory possible in edge cases involving conflicts to destroy the JSON structure -- if this happens, the destruction is isolated to a single cell, and that part of the patch just gets rejected.
The IPython notebook is embedded as an iframe in the main https://cloud.sagemath.com page, but with exactly the same domain, so the main page has full access to the DOM and Javascript of the iframe. Here's what happens when a user makes changes to a synchronized IPython notebook (and at least 1 second has elapsed):
- The outer page notices that the notebook's dirty flag is set for some reason, which could involve anything from typing a character, deleting a bunch of cells, output appearing, etc.
- Computes the JSON representation of the notebook, and from that the document representation (with 1 line per cell) described above. This takes a couple of milliseconds, even for large documents, due to caching.
- The document representation of the notebook gets synchronized with the version stored on the server that the client connected with. (This server is one of many node.js programs that handles many clients at once, and in turn synchronizes with another server that is running in the VM where the IPython notebook server is running. The sync architecture itself is complicated and distributed, and I haven't described it publicly yet.)
- In the previous step, we in fact get a patch that we apply -- in a single automatic operation (so the user is blocked for a few milliseconds) -- to our document representation of the notebook in the iframe. If there are any changes,
the outer page modifies the iframe's notebook in place to match
the document. My first implementation of this update used IPython's noteobook.fromJSON, which could
easily take 5 seconds (!!) or more on some of the online IPython notebook samples.
I spent about two days just optimizing this step.
The main ideas are:
- Map each of the lines of the current document and the new document to a unicode character,
- Use diff-patch-match to find an efficient sequence of deletions, insertions, swaps to transforms one document to the other (i.e., swapping cells, moving cells, etc.) -- this is critical to do,
- Change cells in place when possible.
- Send a broadcast message about the position of your cursor, so the other clients can draw it. (Symmetrically, render the cursor on receiving a broadcast message.)
Monday, September 2, 2013
Status report: integrating IPython into https://cloud.sagemath.com -- my approach
I spent the last few days (it took longer than expected) creating a generic way to *securely* proxy arbitrary http-services from cloud projects, which is now done. I haven't updated the page yet, but I implemented code so that
https://cloud.sagemath.com/[project-id]/port/[port number]/...gets all http requests automatically proxied to the given port at the indicated project. Only logged in users with write access to that project can access this url -- with a lot of work, I think I've set things up so that one can safely create password-less non-ssl web services for a groub of collaborators, and all the authentication just piggy backs on cloud.sagemath accounts and projects: it's SSL-backed (with a valid cert) security almost for free, which solves what I know to be a big problem users have.
The above approach is also nice, since I can embed IPython notebooks via an iframe in cloud.sagemath pages, and the url is exactly the same as cloud.sagemath's, which avoids subtle issues with firewalls, same-source origin, etc. For comparison, here's what the iframe that contains a single ipynb worksheet looks like for wakari.io:
iframe class="notebookiframe" id="" src="https://prod-vz-10.wakari.io:9014/auto_login/acd84627972f91a0838e512f32e09c9823782ec0?next=/notebook_relative/Listing 2.ipynb"and here's what it's going to look like in cloud.sagemath:
iframe class="notebookiframe" id="" src="https://cloud.sagemath.com/70a37ef3-4c3f-4bda-a81b-34b894c89701/port/9100/Listing 2.ipynb"With the wakari.io approach, some users will find that notebooks just don't work, e.g., students at University of Arizona, at least if their wifi still doesn't allow connecting to nonstandard ports, like it did when I tried to setup a Sage notebook server there once for a big conference. By having exactly the same page origin and no nonstandard orts, the way I set things up, the parent page can also directly call javascript functions in the iframe (and vice versa), which is potentially very useful.
IPython notebook servers will be the first to use this framework, then I'll use something similar to serve static files directly out of projects. I'll likely also add sage cell server and the classic sage notebook as well at some point, and maybe wiki's, etc.
Having read and learned a lot of about the IPython notebook, my main concern now is their approach to multiple browsers opening the same document. If you open a single worksheet with multiple browsers, there is absolutely no synchronization at all, since there is no server-side state. Either browser can and will silently overwrite the work of the other when you (auto-)save. It's worse than the Sage Notebook, where at least there is a sequence number and the browser that is behind gets a forced refresh (and a visible warning message about their being another viewer). For running your own IPython notebook on your own computer, this probably isn't a problem (just like a desktop app), but for a long-running web service, where a single user may use a bunch of different computers (home laptop, tablet, office computer, another laptop, etc.) or there may be multiple people involved, I'm uncomfortable that it is so easy for all your work to just get overwritten, so I feel I must find some way to address this problem before releasing IPython support. With cloud.sagemath, a lot of people will likely quickly start running ipython notebook servers for groups of users, since it would take about 1 minute to setup a project with a few collaborators -- then they all get secure access to a collection of ipython notebooks (and other files). So I'm trying to figure out what to do about this. I'll probably just implement a mechanism so that the last client to open an ipython notebook gets that notebook, and all older clients get closed or locked. Maybe in a year IPython will implement proper sync, and I can remove the lock. (On the other hand, maybe they won't -- having no sync has its advantages regarding simplicity and *speed*.)