Jupyter Notebooks: The Third Generation Analytics Engine
What is a Jupyter Notebook?
According to the Python download logs, 106 million people have installed Jupyter Notebooks over the past five years; 48.6 million in 2020 and seven million in July 2020 alone. This begs the obvious questions: what are these things, and why are so many people using them? This brief answers that question.
There are many ways to describe a Jupyter Notebook, but most simply substitute one jargon sentence for another. The important thing about a Jupyter Notebook is what it does and the role it plays for both individuals and companies. A Jupyter Notebook plays the same role that a spreadsheet does: it’s the modern tool for data analysis and numerical modeling, for everything from the timing of pulsars to operating models for businesses to discovering people’s tastes in entertainment to ensuring that Netflix gets delivered to your TV without glitches.
So why use a Notebook instead of a spreadsheet? Well, the differences are:
- Much better analytics and computing. A spreadsheet offers simple formulae, 20 or so functions, and that’s it; a Jupyter Notebook has a full programming environment underneath it, with concomitant power and flexibility. The programming engine underneath has a very simple syntax, and so for the kinds of things that people do in a spreadsheet the formulae are about the same; but Notebooks go far beyond the things people can do with spreadsheets.
- Much better access to a much richer array of data sources. A spreadsheet’s data source is, well, the spreadsheet itself, and hooking it up to other data sources is a complex and awkward operation (download the data using some other tool, convert to a format the spreadsheet likes, then open it up as a spreadsheet).
- Much better interactive graphs, charts, and visualizations If you’re using a spreadsheet, you’re pretty much stuck with the graphs the spreadsheet vendor included. If you’re using Jupyter notebooks with modern tools, literally every charting and mapping product available through the web can be used with your Notebook. You can also attach interactive elements (buttons, sliders, menus, and so on) to Notebook-based charts, so that the data presented can be explored interactively, in real time.
- Much better documentation and explanation of models, data, and calculations. A Notebook is just that — a Notebook, and it combines formatted text with the calculations. The Notebook has such good documentation that’s frequently used for tutorials, presentations (as a replacement for PowerPoint), school and University classroom materials, and even technical publication. Project Callysto in Canada sponsors the development of high-school curricular materials using Jupyter Notebooks; UC Berkeley’s Data Science program runs on Jupyter Notebooks, from lectures through classroom assignments through student projects. The North American Nanohertz Gravitational Wave Observatory — NANOGrav — uses Jupyter Notebooks as a means of scientific collaboration throughout the community and publication. There is now a nascent mathematics journal, Interacta Mathematica, which replaces papers with Jupyter Notebooks. This journal may or may not succeed, but it’s fair to say that nobody ever attempted an Excel-based journal.
Learning Curves and Complexity
Jupyter Notebooks aren’t for everything and for every job. Yes, they’re strictly more powerful than spreadsheets, but they also take more knowledge to use and require a fair amount more machinery on the host computer than a spreadsheet does (and making this a lot easier and less burdensome is what engageLively does for a living). But when a person’s job is most easily done on a spreadsheet, that’s the right tool to use.
There’s an analogy here to the humble calculator, the first generation of analytics engine (and an engine not to be sneered at; Subramanyan Chandrasekhar discovered neutron stars using a mechanical calculator, the first Nobel-winning physics discovery made by mechanical calculation). Sure, spreadsheets automated and made easier a lot of what calculators did, and significantly extended analytics capabilities. But calculators require no infrastructure other than (at most) a battery and they are dirt-simple to use, with almost no learning curve. So calculators still adorn the desks of analysts, engineers, and physicists, and get put to good use. Spreadsheets, the second generation, are a little more complex to use, aren’t quite as intuitive, and require a personal computer and a dedicated application. Jupyter Notebooks require learning at least the rudiments of some programming language, usually Python, and the coordinated action of several pieces of software running on one or more computers. It’s worth the effort and machinery when spreadsheets become cumbersome or just aren’t up to the task; but when a spreadsheet works fine, it’s the tool to use.
The tradeoff is shown in the accompanying figure. Calculators have limited capability but can be learned instantly; spreadsheets offer a little more capability, but take a little more time to learn. A user bumps into the limitations of spreadsheets pretty quickly, but at a price of learning what amounts to a fairly arcane programming language (Excel “macros”) can extend the capabilities a bit. Jupyter Notebooks offer far greater capabilities, with more learning than is required for basic Excel. The authors know both Excel macros and have created Jupyter Notebooks, and can attest that the latter is a lot easier than writing and debugging Excel macros.
OK, So What Are They Really?
We’ve put off a concrete description of the implementation till now, so the reader will have a context to fit this into. A Jupyter Notebook is just a web page, divided between “Text Cells” and “Code Cells”. A “Text Cell” is just that — text, which is typically used to describe what problem the Notebook is solving and what the next Code cell is doing. A Text Cell can be written in either of a couple of formats a web browser understands; HTML or a simplified form, “markdown”. A “Code Cell” is a program fragment, written in one of the many languages Jupyter Notebooks supports. When a Code Cell is executed, the output of the program fragment appears in the cell below the Code Cell.
Again, the generalization of spreadsheets is very clear. A spreadsheet, after all, is a grid of cells, any one of which can be text, a number, or a formula. A Jupyter Notebook has only a single column of cells, but each is far more capable than the corresponding spreadsheet cell. For example, a text cell in a spreadsheet contains exactly that — unformatted text. A full Web page can be contained in any Text Cell, which can be used to offer a deep explanation of what’s going on in the Notebook, images, or anything else that can appear on a Web page. We use this to include arbitrary charts, maps, and visualizations into a Notebook. And we’ve covered the greater capabilities of code cells, above.
What Resources and Infrastructure Do They Need?
A lot. This is obvious from the description above. The Notebook is a web page, so you need a web server (called a “Jupyter Hub”) to serve up the Notebook; a Python process to execute the code cells; and a web browser to view the page and the results. This runtime environment is much more complex than the spreadsheet environment, which required only a program. To build and test their Notebooks today, authors typically download a library which sets up a server on their local computer and hooks up a Python runtime, and the authors will then edit and view the Notebook by logging into the server on their local laptop at the highly-memorable address https:/127.0.0.1:5000 (or whatever), ensuring that the server is actually up and running.
If this seems to you that running a Notebook requires entirely too much knowledge of IT magic…well, you’re right, it does, and that’s where the Cloud comes in.
Authoring a Jupyter Notebook for a data scientist conversant with Python should be no more complex than authoring a spreadsheet. As we saw in the previous section, most of the complexity comes from bringing up and coordinating the various bits of IT infrastructure required to develop and execute a Notebook. A variety of nonprofit and commercial services automates the creation of that server for a customer, so they can focus on writing Notebooks, not dealing with Jupyter Hubs. Some examples include engageLively’s Galyleo family of services; Google’s Colaboratory; UC Berkeley’s Data Hub; and https://syzygy.ca, a Cloud-based Jupyter service operated at and by consortium including 23 Canadian universities and the University of Washington.
Authors: Andreas Bergen, CTO, and Rick McGeer, CEO engageLively