UC Berkeley JupyterHubs Documentation - Division of Data Sciences Technical Staff

Page created by Rodney Figueroa

Careers

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

UC Berkeley JupyterHubs Documentation - Division of Data Sciences Technical Staff

UC Berkeley JupyterHubs
            Documentation

Division of Data Sciences Technical Staff

                                Feb 04, 2022

CONTENTS

1   Using DataHub                                                                                                   3
    1.1 Using DataHub . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   3

2   Modifying DataHub to fit your needs                                                                             13
    2.1 Contributing to DataHub . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   13

                                                                                                                     i

ii

UC Berkeley JupyterHubs Documentation

This repository contains configuration and documentation for the many JupyterHubs used by various organizations in
UC Berkeley.

CONTENTS                                                                                                        1

UC Berkeley JupyterHubs Documentation

2                                       CONTENTS

CHAPTER

                                                                                                                ONE

                                                                                          USING DATAHUB

1.1 Using DataHub

1.1.1 Services Offered

This page lists the various services we offer as part of DataHub. Not all these will be available on all hubs, but we can
easily enable them as you wish.

User Interfaces

Our diverse user population has diverse needs, so we offer many different user interfaces for instructors to choose from.

Jupyter Notebook (Classic)

What many people mean when they say ‘Jupyter’, this familiar interface is used by default for most of our introductory
classes. Document oriented, no-frills, and well known by a lot of people.

                                                                                                                       3

UC Berkeley JupyterHubs Documentation

RStudio

We want to provide first class support for teaching with R, which means providing strong support for RStudio. This
includes Shiny support.
Try without berkeley.edu account:
Try with berkeley.edu account: R DataHub

4                                                                                  Chapter 1. Using DataHub

UC Berkeley JupyterHubs Documentation

JupyterLab

JupyterLab is a more modern version of the classic Jupyter notebook from the Jupyter project. It is more customizable
and better supports some advanced use cases. Many of our more advanced classes use this, and we might help all
classes move to this once there is a simpler document oriented mode available

1.1. Using DataHub                                                                                                 5

UC Berkeley JupyterHubs Documentation

Linux Desktop (Experimental)

Sometimes, you just need to use something that requires a full desktop environment to run. Instead of trying to get
students to install things locally, we offer a full fledged Linux Desktop environment they can access from inside their
browser! This is just a different ‘UI’ on the same infrastructure as the notebook environment, so they all use the same
libraries and home directories.
Try without Berkeley.edu account:
Try with Berkeley.edu account: EECS DataHub

6                                                                                     Chapter 1. Using DataHub

UC Berkeley JupyterHubs Documentation

Visual Studio Code (Experimental)

Sometimes you just want an IDE, not a notebook environment. We are experimenting with a hosted, web version of
the popular Visual Studio Code editor, to see if it would be useful for teaching more traditional CS classes.
Try without Berkeley.edu account:
Try with Berkeley.edu account: EECS DataHub

More?

If you have a web based environment, we can almost certainly make it run under a hub. Contact us and we’ll see what
we can do :)

1.1. Using DataHub                                                                                               7

UC Berkeley JupyterHubs Documentation

Services

Sometimes you need something custom to get your class going. Very very interesting things can happen here, so we’re
always looking for new services to add.

Postgresql

Some of our classes require using real databases to teach. We now experimentally offer a postgresql server for each
user on the data100 hub.
The data does not persist right now, but we can turn that on whenever needed.

Programming languages

We support the usual suspects - Python, R & Julia. However, there are no limits to what languages we can actually
support, so if you are planning on using a different (open source) programming language, contact us and we’ll set you
up.

More?

We want to find solution to your interesting problems, so please bring us your interesting problems

1.1.2 Accessing private GitHub repos

GitHub is used to store class materials (lab notebooks, lecture notebooks, etc), and nbgitpuller is used to distribute it to
students. By default, nbgitpuller only supports public GitHub repositories. However, Berkeley’s JupyterHubs are set
up to allow pulling from private repositories as well.
Public repositories are still preferred, but if you want to distribute a private repository to your students, you can do so.
    1. Go to the GitHub app for the hub you are interested in.
         1. R Hub
         2. DataHub
         3. PublicHealth Hub
         4. Open an issue if you want more hubs supported.
    2. Click the ‘Install’ button.
    3. Select the organization / user containing the private repository you want to distribute on the JupyterHub. If you
       are not the owner or administrator of this organization, you might need extra permissions to do this action.
    4. Select ‘Only select repositories’, and below that select the private repositories you want to distribute to this
       JupyterHub.
    5. Click the ‘Install’ button. The JupyterHub you picked now has access to this private repository. You can revoke
       this anytime by coming back to this page, and removing the repo from the list of allowed repos. You can also
       totally uninstall the GitHub app.
    6. You can now make a link for your repo at nbgitpuller.link. If you had just created your repo, you might have
       to specify main instead of master for the branch name, since GitHub changed the name of the default branch
       recently.

8                                                                                         Chapter 1. Using DataHub

UC Berkeley JupyterHubs Documentation

That’s it! You’re all set. You can distribute these links to your students, and they’ll be able to access your materials!
You can also use more traditional methods (like the git commandline tool, or RStudio’s git interface) to access this
repo as well.
Note: Everyone on the selected JupyterHub can clone your private repo if you do this. They won’t be able to see that this
repo exists, but if they get their hands on your nbgitpuller link they can fetch that too. More fine-grained permissions
coming soon.

1.1.3 JupyterHubs in this repository

DataHub

datahub.berkeley.edu is the ‘main’ JupyterHub for use on UC Berkeley campus. It’s the largest and most active hub. It
has many Python & R packages installed.
It runs on Google Cloud Platform in the ucb-datahub-2018 project. You can see all config for it under deployments/
datahub.

Classes

• The big data8 class.
• Active connector courses
• Data Science Modules
• Astro 128/256
This hub is also the ‘default’ when folks wanna use a hub for a short period of time for any reason without super specific
requirements.

Prob140 Hub

A hub specifically for prob140. Some of the admin users on DataHub are students in prob140 - this would allow them
to see the work of other prob140 students. Hence, this hub is separate until JupyterHub gains features around restricting
admin use.
It runs on Google Cloud Platform in the ucb-datahub-2018 project. You can see all config for it under deployments/
prob140.

Data 100

This hub is for Data 100 which has a unique user and grading environment. It runs on Google Cloud Platform in the
ucb-datahub-2018 account. You can see all config for it under deployments/data100.
Data100 also has shared folders between staff (professors and GSIs) and students. Staff, assuming they have been added
as admins in config/common.yaml, can see a shared and a shared-readwrite folder. Students can only see the
shared folder, which is read-only. Anything that gets put in shared-readwrite is automatically viewable in shared,
but as read-only files. The purpose of this is to be able to share large data files instead of having one per student.

1.1. Using DataHub 9

UC Berkeley JupyterHubs Documentation

Data 102

Data 102 runs on Google Cloud Platform in the ucb-datahub-2018 project. You can see all config for it under
deployments/data102.

Data8X Hub

A hub for the data8x course on EdX. This hub is open to use by anyone in the world, using LTI Authentication to
provide login capability from inside EdX.
It runs on Google Cloud Platform in the data8x-scratch project. You can see all config for it under deployments/
data8x.

1.1.4 User Authentication

UC Berkeley uses a Canvas instance, called bcourses.berkeley.edu. Almost all our hubs use this for authentication,
although not all yet (issue)).

Who has access?

Anyone who can log in to bcourses can log into our JupyterHubs. This includes all berkeley affiliates. If you have a
working berkeley.edu email account, you can most likely log in to bcourses, and hence to our JupyterHubs.
Students have access for 9 months after they graduate. If they have an incomplete, they have 13 months of access
instead.

Non-berkeley affiliates

If someone who doesn’t have a berkeley.edu account wants to use the JupyterHubs, they need to get a CalNet Sponsored
Guest account This gives people access to bcourses, and hence to all the JupyterHubs.

Troubleshooting

If you can log in to bcourses but not to any of the JupyterHubs, please contact us.
If you can not log in to bcourses, please contact bcourses support

1.1.5 Storage Retention Policy

Policy

Criteria

No non-hidden files in the user’s home directory have been modified in the last 12 months.

10 Chapter 1. Using DataHub

UC Berkeley JupyterHubs Documentation

Archival

1. Zip the whole home directory
2. Upload it to Google drive of a SPA created for this purpose
3. Share the ZIP file in the Google Drive with the user.

Rationale

Today (6 Feb 2020), we have 18,623 home directories in datahub. Most of these users used datahub in previous
semesters, have not logged in for a long time, and will probably never log in again. This costs us a lot of money in disk
space - we will have to forever expand disk space.
By cleaning it up after 12 months of non-usage, we will not affect any current users - just folks who haven’t logged in
for a long time. Archiving the contents would make sure people still have access to their old work, without leaving the
burden of maintaining it forever on us.

Why Google Drive?

For UC Berkeley users, Google Drive offers Unlimited Free Space. We can also perform access control easily with
Google Drive.

Alternatives

1. Email it to our users. This will most likely be rejected by most mail servers as the home directory will be too big
an attachment
2. Put it in Google Cloud Nearline storage, build a token based access control mechanism on top, and email this
link to the users. We will need to probably clean this up every 18 months or so for cost reasons. This is the viable
alternative, if we decide to not use Google Drive

1.1. Using DataHub 11

UC Berkeley JupyterHubs Documentation

12                                      Chapter 1. Using DataHub

CHAPTER

TWO

MODIFYING DATAHUB TO FIT YOUR NEEDS

Our infrastructure can serve the diverse needs of our students only if it is built by a diverse array of people.

2.1 Contributing to DataHub

2.1.1 Pre-requisites

Smoothly working with the JupyterHubs maintained in this repository has a number of pre-requisite skills you must
possess. The rest of the documentation assumes you have at least a basic level of these skills, and know how to get help
related to these technologies when necessary.

Basic

These skills let you interact with the repository in a basic manner. This lets you do most ‘self-service’ tasks - such as
adding admin users, libraries, making changes to resource allocation, etc. This doesn’t give you any skills to debug
things when they break, however.
1. Basic git & GitHub skills.
The Git Book & GitHub Help are good resources for this.
2. Familiarity with YAML syntax.
3. Understanding of how packages are installed in the languages we support.
4. Rights to merge changes into this repository on GitHub.

Full

In addition to the basic skills, you’ll need the following skills to ‘fully’ work with this repository. Primarily, you need
this to debug issues when things break - since we strive to never have things break in the same way more than twice.
1. Knowledge of our tech stack:
1. Kubernetes
2. Google Cloud
3. Helm
4. Docker
5. repo2docker
6. Jupyter

UC Berkeley JupyterHubs Documentation

7. Languages we support: Python & R
2. Understanding of our JupyterHub distribution, Zero to JupyterHub.
3. Full access to the various cloud providers we use.

2.1.2 Repository Structure

Hub Configuration

Each hub has a directory under deployments/ where all configuration for that particular hub is stored in a standard for-
mat. For example, all the configuration for the primary hub used on campus (datahub) is stored under deployments/
datahub/.

User Image (image/)

The contents of the image/ directory determine the environment provided to the user. For example, it controls:
1. Versions of Python / R / Julia available
2. Libraries installed, and which versions of those are installed
3. Specific config for Jupyter Notebook or IPython
repo2docker is used to build the actual user image, so you can use any of the supported config files to customize the
image as you wish.

Hub Config (config/ and secrets/)

All our JupyterHubs are based on Zero to JupyterHub (z2jh). z2jh uses configuration files in YAML format to specify
exactly how the hub is configured. For example, it controls:
1. RAM available per user
2. Admin user lists
3. User storage information
4. Per-class & Per-user RAM overrides (when classes or individuals need more RAM)
5. Authentication secret keys
These files are split between files that are visible to everyone (config/) and files that are visible only to a select few
illuminati (secrets/). To get access to the secret files, please consult the illuminati.
Files are further split into:
1. common.yaml - Configuration common to staging and production instances of this hub. Most config should be
here.
2. staging.yaml - Configuration specific to the staging instance of the hub.
3. prod.yaml - Configuration specific to the production instance of the hub.

14 Chapter 2. Modifying DataHub to fit your needs

UC Berkeley JupyterHubs Documentation

hubploy.yaml

We use hubploy to deploy our hubs in a repeatable fashion. hubploy.yaml contains information required for hubploy
to work - such as cluster name, region, provider, etc.
Various secret keys used to authenticate to cloud providers are kept under secrets/ and referred to from hubploy.
yaml.

Documentation

Documentation is under the docs/ folder, and is generated with the sphinx project. It is written with the reStructured-
Text (rst) format. Documentation is automatically published to https://ucb-jupyterhubs.readthedocs.io/.

2.1.3 User home directory storage

All users on all the hubs get a home directory with persistent storage.

Why NFS?

NFS isn’t a particularly cloud-native technology. It isn’t highly available nor fault tolerant by default, and is a single
point of failure. However, it is currently the best of the alternatives available for user home directories, and so we use
it.
1. Home directories need to be fully POSIX compliant file systems that work with minimal edge cases, since this
is what most instructional code assumes. This rules out object-store backed filesystems such as s3fs.
2. Users don’t usually need guaranteed space or IOPS, so providing them each a persistent cloud disk gets unnec-
essarily expensive - since we are paying for it wether it is used or not.
When we did use one persistent disk per user, the storage cost dwarfed everything else by an order of magnitude
for no apparent benefit.
Attaching cloud disks to user pods also takes on average about 30s on Google Cloud, and much longer on Azure.
NFS mounts pretty quickly, getting this down to a second or less.
We’ll probably be on some form of NFS for the foreseeable future.

NFS Server

We currently have two approaches to running NFS Servers.
1. Run a hand-maintained NFS Server with ZFS SSD disks.
This gives us control over performance, size and most importantly, server options. We use anonuid=1000, so
all reads / writes from the cluster are treated as if they have uid 1000, which is the uid all user processes run as.
This prevents us from having to muck about permissions & chowns - particularly since Kubernetes creates new
directories on volumes as root with strict permissions (see issue).
2. Use a hosted NFS service like Google Cloud Filestore.
We do not have to perform any maintenance if we use this - but we have no control over the host machine
either. This necessitates some extra work to deal with the permission issues - see jupyterhub.singleuser.
initContainers in the common.yaml of a hub that uses this method.

2.1. Contributing to DataHub 15

UC Berkeley JupyterHubs Documentation

Right now, every hub except data8x is using the first approach - primarily because Google Cloud Filestore was not
available when they were first set up. data8x is using the second approach, and if proven reliable we will switch
everything to it the next semester.

Home directory paths

Each user on each hub gets their own directory on the server that gets treated as their home directory. The staging &
prod servers share home directory paths, so users get the same home directories on both.
For most hubs, the user’s home directory path relative to the exported NFS directory is /home/
. Prefixing the path with the name of the hub allows us to use the same NFS share for many number
of hubs.

NFS Client

We currently have two approaches for mounting the user’s home directory into each user’s pod.
1. Mount the NFS Share once per node to a well known location, and use hostpath volumes with a subpath on the
user pod to mount the correct directory on the user pod.
This lets us get away with one NFS mount per node, rather than one per pod. See hub/templates/
nfs-mounter.yaml to see how we mount this on the nodes. It’s a bit of a hack, and if we want to keep using
this method should be turned into a CSI Driver instead.
2. Use the Kubernetes NFS Volume provider.
This doesn’t require hacks, but leads to at least 2 NFS mounts per user per node, often leading to hundreds of
NFS mounts per node. This might or might not be a problem.
Most hubs use the first method, while data8x is trialing the second method. If it goes well, we might switch to using
the second method for everything.
We also try to mount everything as soft, since we would rather have a write fail than have processes go into uninter-
ruptible sleep mode (D) where they can not usually be killed when NFS server runs into issues.

2.1.4 Kubernetes Cluster Configuration

We use kubernetes to run our JupyterHubs. It has a healthy open source community, managed offerings from multiple
vendors & a fast pace of development. We can run easily on many different cloud providers with similar config by
running on top of Kubernetes, so it is also our cloud agnostic abstraction layer.
We prefer using a managed Kubernetes service (such as Google Kubernetes Engine). This document lays out our
preferred cluster configuration on various cloud providers.

Google Kubernetes Engine

In our experience, Google Kubernetes Engine (GKE) has been the most stable, performant, and reliable managed
kubernetes service. We prefer running on this when possible.
A gcloud container clusters create command can succintly express the configuration of our kubernetes clus-
ter. The following command represents the currently favored configuration.

16 Chapter 2. Modifying DataHub to fit your needs

UC Berkeley JupyterHubs Documentation

gcloud container clusters create \
     --enable-ip-alias \
     --enable-autoscaling \
     --max-nodes=20 --min-nodes=1 \
     --region=us-central1 --node-locations=us-central1-b \
     --image-type=cos \
     --disk-size=100 --disk-type=pd-balanced \
     --machine-type=n1-highmem-8 \
     --cluster-version latest \
     --no-enable-autoupgrade \
     --enable-network-policy \
     --create-subnetwork="" \
     --tags=hub-cluster \
     
gcloud container node-pools create \
    --machine-type n1-highmem-8 \
    --num-nodes 1 \
    --enable-autoscaling \
    --min-nodes 1 --max-nodes 20 \
    --node-labels hub.jupyter.org/pool-name=-pool \
    --node-taints hub.jupyter.org_dedicated=user:NoSchedule \
    --region=us-central1 \
    --image-type=cos_containerd \
    --disk-size=200 --disk-type=pd-balanced \
    --no-enable-autoupgrade \
    --tags=hub-cluster \
    --cluster=fall-2019 \
    user----

IP Aliasing

--enable-ip-alias creates VPC Native Clusters.
This becomes the default soon, and can be removed once it is the default.

Autoscaling

We use the kubernetes cluster autoscaler to scale our node count up and down based on demand. It waits until the
cluster is completely full before triggering creation of a new node - but that’s ok, since new node creation time on GKE
is pretty quick.
--enable-autoscaling turns the cluster autoscaler on.
--min-nodes sets the minimum number of nodes that will be maintained regardless of demand. This should ideally
be 2, to give us some headroom for quick starts without requiring scale ups when the cluster is completely empty.
--max-nodes sets the maximum number of nodes that the cluster autoscaler will use - this sets the maximum number
of concurrent users we can support. This should be set to a reasonably high number, but not too high - to protect against
runaway creation of hundreds of VMs that might drain all our credits due to accident or security breach.

2.1. Contributing to DataHub                                                                                          17

UC Berkeley JupyterHubs Documentation

Highly available master

The kubernetes cluster’s master nodes are managed by Google Cloud automatically. By default, it is deployed in a
non-highly-available configuration - only one node. This means that upgrades and master configuration changes cause
a few minutes of downtime for the kubernetes API, causing new user server starts / stops to fail.
We request our cluster masters to have highly available masters with --region parameter. This specifies the region
where our 3 master nodes will be spread across in different zones. It costs us extra, but it is totally worth it.
By default, asking for highly available masters also asks for 3x the node count, spread across multiple zones. We don’t
want that, since all our user pods have in-memory state & can’t be relocated. Specifying --node-locations explicitly
lets us control how many and which zones the nodes are located in.

Region / Zone selection

We generally use the us-central1 region and a zone in it for our clusters - simply because that is where we have
asked for quota.
There are regions closer to us, but latency hasn’t really mattered so we are currently still in us-central1. There are also
unsubstantiated rumors that us-central1 is their biggest data center and hence less likely to run out of quota.

Disk Size

--disk-size sets the size of the root disk on all the kubernetes nodes. This isn’t used for any persistent storage such
as user home directories. It is only used ephemerally for the operations of the cluster - primarily storing docker images
and other temporary storage. We can make this larger if we use a large number of big images, or if we want our image
pulls to be faster (since disk performance increases with disk size ).
--disk-type=pd-standard gives us standard spinning disks, which are cheaper. We can also request SSDs
instead with --disk-type=pd-ssd - it is much faster, but also much more expensive. We compromise with
--disk-type=pd-balanced, faster than spinning disks but not as fast as ssds all the time.

Node size

--machine-type lets us select how much RAM and CPU each of our nodes have. For non-trivial hubs, we generally
pick n1-highmem-8, with 52G of RAM and 8 cores. This is based on the following heuristics:
1. Students generally are memory limited than CPU limited. In fact, while we have a hard limit on memory use
per-user pod, we do not have a CPU limit - it hasn’t proven necessary.
2. We try overprovision clusters by about 2x - so we try to fit about 100G of total RAM use in a node with about
50G of RAM. This is accomplished by setting the memory request to be about half of the memory limit on user
pods. This leads to massive cost savings, and works out ok.
3. There is a kubernetes limit on 100 pods per node.
Based on these heuristics, n1-highmem-8 seems to be most bang for the buck currently. We should revisit this for
every cluster creation.

18 Chapter 2. Modifying DataHub to fit your needs

UC Berkeley JupyterHubs Documentation

Cluster version

GKE automatically upgrades cluster masters, so there is generally no harm in being on the latest version available.

Node autoupgrades

When node autoupgrades are enabled, GKE will automatically try to upgrade our nodes whenever needed (our GKE
version falling off the support window, security issues, etc). However, since we run stateful workloads, we disable this
right now so we can do the upgrades manually.

Network Policy

Kubernetes Network Policy lets you firewall internal access inside a kubernetes cluster, whitelisting only the flows you
want. The JupyterHub chart we use supports setting up appropriate NetworkPolicy objects it needs, so we should turn
it on for additional security depth. Note that any extra in-cluster services we run must have a NetworkPolicy set up for
them to work reliabliy.

Subnetwork

We put each cluster in its own subnetwork, since seems to be a limit on how many clusters you can create in the same
network with IP aliasing on - you just run out of addresses. This also gives us some isolation - subnetworks are isolated
by default and can’t reach other resources. You must add firewall rules to provide access, including access to any
manually run NFS servers. We add tags for this.

Tags

To help with firewalling, we add network tags to all our cluster nodes. This lets us add firewall rules to control traffic
between subnetworks.

Cluster name

We try use a descriptive name as much as possible.

2.1.5 Cloud Credentials

Google Cloud

Service Accounts

Service accounts are identified by a service key, and help us grant specific access to an automated process. Our CI
process needs two service accounts to operate:
1. A gcr-readwrite key. This is used to build and push the user images. Based on the docs, this is assigned the
role roles/storage.admin.
2. A gke key. This is used to interact with the Google Kubernetes cluster. Roles roles/container.clusterViewer and
roles/container.developer are granted to it.

2.1. Contributing to DataHub 19

UC Berkeley JupyterHubs Documentation

These are currently copied into the secrets/ dir of every deployment, and explicitly referenced from hubploy.yaml
in each deployment. They should be rotated every few months.
You can create service accounts through the web console or the commandline. Remember to not leave around copies
of the private key elsewhere on your local computer!

2.1.6 Incident reports

Blameless incident reports are very important for long term sustainability of resilient infrastructure. We publish them
here for transparency, and so we may learn from them for future incidents.

2017-02-09 - JupyterHub db manual overwrite

Summary

Datahub was reportedly down at 1am. Users attempting to log in to datahub were greeted with a proxy error. The hub
pod was up but the log was full of sqlite errors. After the hub pod was deleted and a new one came up, students logging
in to datahub found their notebooks were missing and their home directories were empty. Once this was fixed, some
students still were being logged in as a different particular user. Finally, students with a ‘.’ in their username were still
having issues after everyone else was fine. This was all fixed and an all-clear signalled at about 2017-02-09 11:35 AM.

Timeline

2017-02-09 00:25 - 00:29 AM

Attempting to debug some earlier 400 errors, Trying to set base_url and ip to something incorrect to see if it will
cause a problem.

kubectl exec hub-deployment-something --namespace=datahub -it bash
apt-get install sqlite3
sqlite3

ATTACH   'jupyterhub.sqlite AS my_db;
SELECT   name FROM my_db.sqlite_master WHERE type='table';
SELECT   * FROM servers;
SELECT   * FROM servers WHERE base_url LIKE '%%';
UPDATE   servers SET ip='' WHERE base_url LIKE '%%';
UPDATE   servers SET base_url='/ WHERE base_url LIKE '%%';

Ctrl+D (exit back into bash shell)
checked datahub.berkeley.edu, and nothing happened to the account saw that the sql db was not updated, attempt to
run .save

```bash
sqlite3

.save jupyterhub.sqlite

This replaced the db with an empty one, since ATTACH was not run beforehand.

20                                                              Chapter 2. Modifying DataHub to fit your needs

UC Berkeley JupyterHubs Documentation

0:25:59 AM

Following exception shows up in hub logs:

sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) no such table: proxies [SQL:
˓→'SELECT proxies.id AS proxies_id, proxies._public_server_id AS proxies__public_server_

˓→id, proxies._api_server_id AS proxies__api_server_id \nFROM proxies \nWHERE proxies.id␣

˓→= ?'] [parameters: (1,)]

This continues for hub table as well, since those two seem to be most frequently used.

1:12 AM

Sam’s roommate notices that he can log in to datahub but all his notebooks are gone. We notice that there are only
~50 users on the JHub admin panel when there used to be ~1000, so we believe that this is because the JHub sqlite
user database got wiped/corrupted, then created an account for his roommate when he logged in, then created a new
persistent disk since it lost track of his old one.
This is confirmed soon after:

$ kubectl --namespace=datahub get pvc | grep
claim--257 Bound pvc-3b405e13-ddb4-11e6-98ef-42010af000c3 10Gi ␣
˓→RWO 21d
claim--51 Bound pvc-643dd900-eea7-11e6-a291-42010af000c3 10Gi ␣
˓→RWO 5m

1:28 AM

We shut down the hub pod by scaling the replicas to 0.
We then begin recreating the JHub sqlite database by taking the Kubernetes PVCs and matching them back with the
user ids. We could do this because the name of the PVC contains a sanitized form of the username and the userid.
Here’s the notebook that was used to recreate the db from PVCs: [pvc-sqlite.ipynb 2017-02-09-datahub-db-outage-
pvc-recreate-script.ipynb](pvc-sqlite.ipynb 2017-02-09-datahub-db-outage-pvc-recreate-script.ipynb)

2:34 AM

We recreate the sqlite3 database. Initially each user’s cookie_id was set to a dummy cookie value.

2:42 AM

User cookie_id values are changed to null rather than dummy value. The sqlite file is then attached back to datahub.
The number of users shown on admin page is back to ~1000. The hub was up, and a spot check of starting other user’s
servers seem to work. Some users get redirected to one particular user, but deleting and recreating the affected user
seems to fix this.

2.1. Contributing to DataHub 21

UC Berkeley JupyterHubs Documentation

10:11 AM

Attempt to log everyone out by changing cookie secret in hub pod at /srv/jupyterhub/jupyterhub_cookie_secret. Just
one character near the end was changed, and pod restarted. No effect. One character at the beginning of secret was
changed next, and restarted - this caused actual change, and logged all users out.
People are still being redirected to one particular user’s account when they log in. More looking around required.

10:17 AM

John Denero advises students to use ds8.berkeley.edu right now. ds8.berkeley.edu promptly starts crashing because it
does not have resources for a data8 level class.

10:29 AM

All user pods are deleted, which finally properly logs everyone out. However, people logging in are still all getting the
same user’s pods.

10:36 AM

Notice that cookie_id column in the user database table is empty for many users, and the user that everyone is being
logged in as has an empty cookie_id too and is the ‘first’ on the table when sorted in ascending by id. Looking at
the JupyterHub code, cookie_id is always supposed to be set to a uuid, and never supposed to be empty. Setting
cookie_id for users fixes their issues, and seems to spawn them into their own notebook.

10:45 AM

A script is run that populates cookie_id for all users, and restarts the hub to make sure there’s no stale cache in RAM.
All user pods are deleted again. Most users are back online now! More users start testing and confirming things are
working for them.

10:53 AM

User with a ‘.’ in their name reports that they’re getting an empty home directory. More investigation shows two users
- one with a ‘.’ in their name that is newer, and one with a ‘-’ in their name instead of ‘.’ that is older. Hypothesis is
that one of them is the ‘original’, but they’re all attaching to a new one that is empty. Looking at pvcs confirms this -
there are two PVCs for users with a . in their name who have tried to log in, and they differ only by ids.
There is some confusion about users ending up on prob140, because the data8.org homework link is changed to use
that temporarily.

22 Chapter 2. Modifying DataHub to fit your needs

UC Berkeley JupyterHubs Documentation

11:05 AM

Directly modifying the user table to rename the user with the ‘-’ in the name to have a ‘.’ seems to work for people.

11:15 AM

A script is run that modifies the database user table for all users with a ‘-’ in their name, and the ‘-’ is replaced with a
‘.’. The new users created with the ‘.’ in their name are dropped before this.

11:17 AM

All clear given for datahub.berkeley.edu

11:19 AM

Locally verified that running .save on sqlite3 will overwrite the db file without any confirmation, and is most likely
cause of the issue. Conclusion Accidental overwriting of the sqlite file during routine debugging operation led all
tables being deleted. Users were getting new user ids when they were logging in now, causing them to get new disks
provisioned - and these disks were empty. During reconstruction of the db, cookie_id was missing for several users,
causing them all to log in to one particular user’s notebook. Users with ‘.’ in their name were also set up slightly
incorrectly - their pods have ‘-’ in them but the user name should have a ‘.’.

Action items

Upstream bug reports for JupyterHub

1. JupyterHub only uses a certain length of the cookie secret, and discards the rest. This causes confusion when
trying to change it to log people out. Issue
2. The cookie_id column in the users table should have UNIQUE and NOT NULL constraints. Issue

Upstream bug reports for KubeSpawner

1. Support using username hashes in PVC and Pod Names rather than user ids, so that pod and PVC names remain
constant even when DB is deleted. Issue

Upstream bug reports for OAuthenticator

1. Support setting id of user in user table to be same as ‘id’ provided by Google authenticator, thus providing a
stable userid regardless of when the user first logged in. Issue

2.1. Contributing to DataHub 23

UC Berkeley JupyterHubs Documentation

DataHub deployment changes

1. Switch to using Google Cloud SQL, which provides hosted and managed MySQL database
2. Perform regular and tested backups of the database
3. Start writing an operational FAQ for things to do and not do
4. Setup better monitoring and paging systems
5. Document escalation procedures explicitly

2017-02-24 - Custom Autoscaler gonee haywire

Summary

On the evening of February 24, 2017, a premature version of the Autoscaler script for the Datahub deployment was
mistakenly run on the prod cluster, resulting in a large amount of nodes (roughly 30-40) being set as unschedulable
for about 20 minutes. Though no information was lost nor service critically disturbed, it was necessary to manually
re-enable these nodes to be scheduled.

Timeline

As of this commit in the Autoscaler branch history, there exists a scale.py file that would based on the utilization of
the cluster, mark a certain number of nodes unschedulable before attempting to shut down nodes with no pods in them.
Unfortunately, this script was executed prematurely, and without configuration, looked to execute in whatever context
currently specified in .kube/config, which ended up being the production cluster rather than the dev cluster.

2017-02-24 11:14 PM

Script is mistakenly executed. A bug in the calculations for the utilization of the cluster leads to about 40 nodes being
marked as unschedulable. The mistake is noted immediately.

2017-02-24 11:26 PM

The unschedulability of these nodes is reverted. All nodes in the cluster were first all set to be schedulable to ensure
that no students current and future would be disturbed. Immediately after, 10 of the most idle nodes on the cluster were
manually set to be unschedulable (to facilitate them later being manually descaled - to deal with https://github.com/data-
8/infrastructure/issues/6) using kubectl cordon .

Conclusion

A cluster autoscaler script was accidentally run against the production cluster instead of the dev cluster, reducing
capacity for new user logins for about 12 minutes. There was still enough capacity so we had no adverse effects.

24 Chapter 2. Modifying DataHub to fit your needs

UC Berkeley JupyterHubs Documentation

Action Items

Datahub Deployment Changes

1. The Autoscaler should not be run unless the context is explicitly set via environment variables or command line
arguments. This is noted in the comments of the pull request for the Autoscaler.
2. The idea of the ‘current context’ should be abolished in all the tools we build / read.

Future organizational change

1. Use a separate billing account for production vs development clusters. This makes it harder to accidentally run
things on the wrong cluster

2017-02-24 - Proxy eviction strands user

Summary

On the evening of Feb 23, several students started experiencing 500 errors in trying to access datahub. The proxy had
died because of a known issue, and it took a while for the hub to re-add all the user routes to the proxy. Some students’
needed their servers to be manually restarted, due to a JupyterHub spawner bug that is showing up at scale. Everything
was fixed in about 40 minutes.

Timeline

All times in PST

21:10:57

The proxy pod is evicted, due to a known issue that is currently being worked on. Users start running into issue now,
with connection failures.

21:11:04

New proxy pod is started by kubernetes, and starts accepting connections. However, the JupyterHub model currently
has the proxy starting with no state about user routes, and so the users’ requests aren’t being routed to their notebook
pods. This manifests as errors for users.
The hub process is supposed to poll the proxy every 300s, and repopulate the route table when it notices it is empty.
The hub does this at some point in the next 300s (we do not know when), and starts repopulating the route table. As
routes get added for currently users, their notebook starts working again.

2.1. Contributing to DataHub 25

UC Berkeley JupyterHubs Documentation

21:11:52

The repopulate process starts running into issues - it is making far too many http requests (to the kuber-
netes and proxy APIs) that it starts running into client side limits on tornado http client (which is what we
use to make these requests). This causes them to time out on the request queue. We were running into
https://github.com/tornadoweb/tornado/issues/1400. Not all requests fail - for those that succeed, the students are able
to access their notebooks.
The repopulate process takes a while to process, and errors for a lot of students who are left with notebook in inconsistent
state - JupyterHub thinks their notebook is running but it isn’t, or vice versa. Lots of 500s for users.

21:14

Reports of errors start reaching the Slack channel + Piazza.
The repopulate process keeps being retried, and notebooks for users slowly come back. Some users are ‘stuck’ in a bad
state, however - their notebook isn’t running, but JupyterHub thinks it is (or vice versa).

21:34

Most users are fine by now. For those still with problems, a forced delete from the admin interface + a start works,
since this forces JupyterHub to really check if they’re there or not.

22:03

Last reported user with 500 error is fixed, and datahub is fully operational again.

Conclusion

This is almost a ‘perfect storm’ event. Three things colluded to make this outage happen:
1. The inodes issue, which causes containers to fail randomly
2. The fact that the proxy is a single point of failure with a longish recovery time in current JupyterHub architecture.
3. KubeSpawner’s current design is inefficient at very high user volumes, and its request timeouts & other perfor-
mance characteristics had not been tuned (because we have not needed to before).
We have both long term (~1-2 months) architectural fixes as well as short term tuning in place for all three of these
issues.

Action items

Upstream JupyterHub

1. Work on abstracting the proxy interface, so the proxy is no longer a single point of failure. Issue

26 Chapter 2. Modifying DataHub to fit your needs

UC Berkeley JupyterHubs Documentation

Upstream KubeSpawner

1. Re-architect the spawner to make a much smaller number of HTTP requests. DataHub has become big enough
that this is a problem. Issue
2. Tune the HTTP client kubespawner uses. This would be an interim solution until (1) gets fixed. Issue

DataHub configuration

1. Set resource requests explicitly for hub and proxy, so they have less chance of getting evicted. Issue
2. Reduce the interval at which the hub checks to see if the proxy is running. PR
3. Speed up the fix for the inodes issue which is what triggered this whole issue.

2017-03-06 - Non-matching hub image tags cause downtime

Summary

On the evening of Mar 6, the hub on prod would not come up after an upgrade. The upgrade was to accommodate a
new disk for cogneuro that had been tested on dev. After some investigation it was determined that the helm’s config
did not match the hub’s image. After the hub image was rebuilt and pushed out, then tested on dev, it was pushed out
to prod. The problem was fixed in about 40 minutes.
A few days later (March 12), similar almost outage is avoided when -dev breaks and deployment is put on hold. More
debugging shows the underlying cause is that git submodules are hard to use. More documentation is provided, and
downtime is averted!

Timeline

All times in PST

March 6 2017 22:59

dev changes are deployed but hub does not start correctly. The describe output for the hub shows repeated instances
of:
Error syncing pod, skipping: failed to “StartContainer” for “hub-container” with CrashLoopBackOff: “Back-off
10s restarting failed container=hub-container pod=hub-deployment-3498421336-91gp3_datahub-dev(bfe7d8bd-0303-
11e7-ade6-42010a80001a)
helm chart for -dev is deleted and reinstalled.

2.1. Contributing to DataHub 27

UC Berkeley JupyterHubs Documentation

23:11

dev changes are deployed successfully and tested. cogneuro’s latest data is available.

23:21

Changes are deployed to prod. The hub does not start properly. get pod -o=yaml on the hub pod shows that the hub
container has terminated. The hub log shows that it failed due to a bad configuration parameter.

21:31

While the helm chart had been updated from git recently, the latest tag for the hub did not correspond with the one in
either prod.yaml or dev.yaml.

21:41

The hub image is rebuilt and pushed out.

21:45

The hub is deployed on -dev.

21:46

The hub is tested on -dev then deployed on -prod.

21:50

The hub is tested on -prod. Students are reporting that the hub had been down.

March 12 19:57

A new deploy is attempted on -dev, but runs into same error. Deployments are halted for more debugging this time,
and more people are called on.

23:21

More debugging reveals that the commit update looked like this:

diff --git a/chart b/chart
index e38aba2..c590340 160000
--- a/chart
+++ b/chart
@@ -1 +1 @@
-Subproject commit e38aba2c5601de30c01c6f3c5cad61a4bf0a1778
                                                                                                  (continues on next page)

28                                                           Chapter 2. Modifying DataHub to fit your needs

UC Berkeley JupyterHubs Documentation

                                                                                           (continued from previous page)
+Subproject commit c59034032f8870d16daba7599407db7e6eb53e04
diff --git a/data8/dev.yaml b/data8/dev.yaml
index 2bda156..ee5987b 100644
--- a/data8/dev.yaml
+++ b/data8/dev.yaml
@@ -13,7 +13,7 @@ publicIP: "104.197.166.226"

 singleuser:
   image:
-    tag: "e4af695"
+    tag: "1a6c6d8"
   mounts:
     shared:
       cogneuro88: "cogneuro88-20170307-063643"

Only the tag should’ve been the only thing updated.                          The chart submodule is updated to
c59034032f8870d16daba7599407db7e6eb53e04, which is from February 25 (almost two weeks old). This
is the cause of the hub failing, since it is using a really old chart commit with a new hub image.

23:27

It is determined that incomplete documentation about deployment processes caused git submodule update to be
not run after a git pull, and so the chart was being accidentally moved back to older commits. Looking at the commit
that caused the outage on March 6 showed the exact same root cause.

Conclusion

Git submodules are hard to use, and break most people’s mental model of how git works. Since our deployment requires
that the submodule by in sync with the images used, this caused an outage.

Action items

Process

   1. Make sure we treat any errors in -dev exactly like we would in prod. Any deployment error in prod should
      immediately halt future deployments & require a rollback or resolution before proceeding.
   2. Write down actual deployment documentation & a checklist.
   3. Move away from git submodules to a separate versioned chart repository.

2.1. Contributing to DataHub                                                                                         29

UC Berkeley JupyterHubs Documentation

2017-03-20 - Too many volumes per disk leave students stuck

Summary

From sometime early March 20 2017 till about 1300, some new student servers were stuck in Pending forever, giving
them 500 errors. This was an unintended side-effect of reducing student memory limit to 1G while keeping the size of
our nodes constant, causing us to hit a Google Cloud limit on number of disks per node. This was fixed by spawning
more nodes that were smaller.

Timeline

March 18, 16:30

RAM per student is reduced from 2G to 1G, as a resource optimization measure. The size of our nodes remains the
same (26G RAM), and many are cordonned off and slowly decomissioned over the coming few days.
Life seems fine, given the circumstances.

March 20, 12:44

New student servers report a 500 error preventing them from logging on. This is deemed widespread & not an isolated
incident.

12:53

A kubectl describe pod on an affected student’s pod shows it’s stuck in Pending state, with an error message:

pod failed to fit in any node fit failure on node (XX): MaxVolumeCount

This seems to be common problem for all the new student servers, which are all stuck in Pending state.
Googling leads to https://github.com/kubernetes/kubernetes/issues/24317 - even though Google Compute Engine can
handle more than 16 disks per node (we had checked this before deploying), Kubernetes itself still can not. This wasn’t
foreseen, and seemed to be the direct cause of the incident.

13:03

A copy of the instance template that is used by Google Container Engine is made and then modified to spawn smaller
nodes (n1-highmem-2 rather than n1-highmem-4). The managed instance group used by Google Container Engine is
then modified to use the new template. This was the easiest way to not distrupt students for whom things are working,
while also allowing new students to be able to log in.
This new instance group was then set to expand for 30 new nodes, which will provide capacity for about 12 students
each. populate.bash was also run to make sure that students pods start up on time in the newnodes.

30 Chapter 2. Modifying DataHub to fit your needs

UC Berkeley JupyterHubs Documentation

13:04

The simple autoscaler is stopped, on fear that it’ll be confused by the unusal mixed state of the nodes and do something
wonky.

13:11

All the new nodes are online, and populate.bash has completed. Pods start leaving the Pending state.
However, since it’s been more than the specified timeout that JupyterHub will wait before giving up on Pod (5 minutes),
JupyterHub doesn’t know the pods exist. This causes state of cluster + state in JupyterHub to go out of sync, causing
the dreaded ‘redirected too many times’ error. Admins need to manually stop and start user pods in the control panel
as users report this to fix this issue.

14:23

The hub and proxy pods are restarted since there were plenty of ‘redirected too many times’ errors. This seems to catch
most users state, although some requests still failed with a 599 timeout (similar to an earlier incident, but much less
frequent). A long tail of manual user restarts are performed by admins over the next few days.

Action Items

Upstream: Kubernetes

1. Keep an eye on the status of the bug we ran into

Upstream: JupyterHub

1. Track down and fix the ‘too many redirects’ issue at source. Issue

Cleanup

1. Delete all the older larger nodes that are no longer in use. (Done!)

Monitoring

1. Have alerting for when there are any number of pods in Pending state for a non-negligible amount of time. There
is always something wrong when this happens.

2.1. Contributing to DataHub 31

UC Berkeley JupyterHubs Documentation

2017-03-23 - Weird upstream ipython bug kills kernels

Summary

A seemingly unrelated change caused user kernels to die on start (making notebook execution impossible) for newly
started user servers from about Mar 22 19:30 to Mar 23 09:45. Most users didn’t see any errors until start of class at
about 9AM, since they were running servers that were previously started.

Timeline

March 22, around 19:30

A deployment is performed, finally deploying https://github.com/data-8/jupyterhub-k8s/pull/146 to production. It
seemed to work fine on -dev, and on prod as well. However, the testing regimen was only to see if a notebook server
would show up - not if a kernel would spawn.

Mar 23, 09:08

Students report that their kernels keep dying. This is confirmed to be a problem for all newly launched notebooks, in
both prod and dev.

09:16

The last change to the repo (an update of the single-user image) is reverted, to check if that was causing the problem.
This does not improve the situation. Debugging continues, but with no obvious angles of attack.

09:41

After debugging produces no obvious culprits, the state of the entire infrastructure for prod is reverted to a known good
state from a few days ago. This was done with:

./deploy.py prod data8 25abea764121953538713134e8a08e0291813834

25abea764121953538713134e8a08e0291813834 is the commit hash of a known good commit from March 19. Our
disciplined adherence to immutable & reproducible deployment paid off, and we were able to restore new servers to
working order with this!
Students are now able to resume working after a server restart. A mass restart is also performed to aid this.
Dev is left in a broken state in an attempt to debug.

32 Chapter 2. Modifying DataHub to fit your needs

UC Berkeley JupyterHubs Documentation