UC Berkeley JupyterHubs Documentation - Division of Data Sciences Technical Staff
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
CONTENTS 1 Using DataHub 3 1.1 Using DataHub . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Modifying DataHub to fit your needs 13 2.1 Contributing to DataHub . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 i
ii
UC Berkeley JupyterHubs Documentation This repository contains configuration and documentation for the many JupyterHubs used by various organizations in UC Berkeley. CONTENTS 1
UC Berkeley JupyterHubs Documentation 2 CONTENTS
CHAPTER ONE USING DATAHUB 1.1 Using DataHub 1.1.1 Services Offered This page lists the various services we offer as part of DataHub. Not all these will be available on all hubs, but we can easily enable them as you wish. User Interfaces Our diverse user population has diverse needs, so we offer many different user interfaces for instructors to choose from. Jupyter Notebook (Classic) What many people mean when they say ‘Jupyter’, this familiar interface is used by default for most of our introductory classes. Document oriented, no-frills, and well known by a lot of people. 3
UC Berkeley JupyterHubs Documentation RStudio We want to provide first class support for teaching with R, which means providing strong support for RStudio. This includes Shiny support. Try without berkeley.edu account: Try with berkeley.edu account: R DataHub 4 Chapter 1. Using DataHub
UC Berkeley JupyterHubs Documentation JupyterLab JupyterLab is a more modern version of the classic Jupyter notebook from the Jupyter project. It is more customizable and better supports some advanced use cases. Many of our more advanced classes use this, and we might help all classes move to this once there is a simpler document oriented mode available 1.1. Using DataHub 5
UC Berkeley JupyterHubs Documentation Linux Desktop (Experimental) Sometimes, you just need to use something that requires a full desktop environment to run. Instead of trying to get students to install things locally, we offer a full fledged Linux Desktop environment they can access from inside their browser! This is just a different ‘UI’ on the same infrastructure as the notebook environment, so they all use the same libraries and home directories. Try without Berkeley.edu account: Try with Berkeley.edu account: EECS DataHub 6 Chapter 1. Using DataHub
UC Berkeley JupyterHubs Documentation Visual Studio Code (Experimental) Sometimes you just want an IDE, not a notebook environment. We are experimenting with a hosted, web version of the popular Visual Studio Code editor, to see if it would be useful for teaching more traditional CS classes. Try without Berkeley.edu account: Try with Berkeley.edu account: EECS DataHub More? If you have a web based environment, we can almost certainly make it run under a hub. Contact us and we’ll see what we can do :) 1.1. Using DataHub 7
UC Berkeley JupyterHubs Documentation Services Sometimes you need something custom to get your class going. Very very interesting things can happen here, so we’re always looking for new services to add. Postgresql Some of our classes require using real databases to teach. We now experimentally offer a postgresql server for each user on the data100 hub. The data does not persist right now, but we can turn that on whenever needed. Programming languages We support the usual suspects - Python, R & Julia. However, there are no limits to what languages we can actually support, so if you are planning on using a different (open source) programming language, contact us and we’ll set you up. More? We want to find solution to your interesting problems, so please bring us your interesting problems 1.1.2 Accessing private GitHub repos GitHub is used to store class materials (lab notebooks, lecture notebooks, etc), and nbgitpuller is used to distribute it to students. By default, nbgitpuller only supports public GitHub repositories. However, Berkeley’s JupyterHubs are set up to allow pulling from private repositories as well. Public repositories are still preferred, but if you want to distribute a private repository to your students, you can do so. 1. Go to the GitHub app for the hub you are interested in. 1. R Hub 2. DataHub 3. PublicHealth Hub 4. Open an issue if you want more hubs supported. 2. Click the ‘Install’ button. 3. Select the organization / user containing the private repository you want to distribute on the JupyterHub. If you are not the owner or administrator of this organization, you might need extra permissions to do this action. 4. Select ‘Only select repositories’, and below that select the private repositories you want to distribute to this JupyterHub. 5. Click the ‘Install’ button. The JupyterHub you picked now has access to this private repository. You can revoke this anytime by coming back to this page, and removing the repo from the list of allowed repos. You can also totally uninstall the GitHub app. 6. You can now make a link for your repo at nbgitpuller.link. If you had just created your repo, you might have to specify main instead of master for the branch name, since GitHub changed the name of the default branch recently. 8 Chapter 1. Using DataHub
UC Berkeley JupyterHubs Documentation That’s it! You’re all set. You can distribute these links to your students, and they’ll be able to access your materials! You can also use more traditional methods (like the git commandline tool, or RStudio’s git interface) to access this repo as well. Note: Everyone on the selected JupyterHub can clone your private repo if you do this. They won’t be able to see that this repo exists, but if they get their hands on your nbgitpuller link they can fetch that too. More fine-grained permissions coming soon. 1.1.3 JupyterHubs in this repository DataHub datahub.berkeley.edu is the ‘main’ JupyterHub for use on UC Berkeley campus. It’s the largest and most active hub. It has many Python & R packages installed. It runs on Google Cloud Platform in the ucb-datahub-2018 project. You can see all config for it under deployments/ datahub. Classes • The big data8 class. • Active connector courses • Data Science Modules • Astro 128/256 This hub is also the ‘default’ when folks wanna use a hub for a short period of time for any reason without super specific requirements. Prob140 Hub A hub specifically for prob140. Some of the admin users on DataHub are students in prob140 - this would allow them to see the work of other prob140 students. Hence, this hub is separate until JupyterHub gains features around restricting admin use. It runs on Google Cloud Platform in the ucb-datahub-2018 project. You can see all config for it under deployments/ prob140. Data 100 This hub is for Data 100 which has a unique user and grading environment. It runs on Google Cloud Platform in the ucb-datahub-2018 account. You can see all config for it under deployments/data100. Data100 also has shared folders between staff (professors and GSIs) and students. Staff, assuming they have been added as admins in config/common.yaml, can see a shared and a shared-readwrite folder. Students can only see the shared folder, which is read-only. Anything that gets put in shared-readwrite is automatically viewable in shared, but as read-only files. The purpose of this is to be able to share large data files instead of having one per student. 1.1. Using DataHub 9
UC Berkeley JupyterHubs Documentation Data 102 Data 102 runs on Google Cloud Platform in the ucb-datahub-2018 project. You can see all config for it under deployments/data102. Data8X Hub A hub for the data8x course on EdX. This hub is open to use by anyone in the world, using LTI Authentication to provide login capability from inside EdX. It runs on Google Cloud Platform in the data8x-scratch project. You can see all config for it under deployments/ data8x. 1.1.4 User Authentication UC Berkeley uses a Canvas instance, called bcourses.berkeley.edu. Almost all our hubs use this for authentication, although not all yet (issue)). Who has access? Anyone who can log in to bcourses can log into our JupyterHubs. This includes all berkeley affiliates. If you have a working berkeley.edu email account, you can most likely log in to bcourses, and hence to our JupyterHubs. Students have access for 9 months after they graduate. If they have an incomplete, they have 13 months of access instead. Non-berkeley affiliates If someone who doesn’t have a berkeley.edu account wants to use the JupyterHubs, they need to get a CalNet Sponsored Guest account This gives people access to bcourses, and hence to all the JupyterHubs. Troubleshooting If you can log in to bcourses but not to any of the JupyterHubs, please contact us. If you can not log in to bcourses, please contact bcourses support 1.1.5 Storage Retention Policy Policy Criteria No non-hidden files in the user’s home directory have been modified in the last 12 months. 10 Chapter 1. Using DataHub
UC Berkeley JupyterHubs Documentation Archival 1. Zip the whole home directory 2. Upload it to Google drive of a SPA created for this purpose 3. Share the ZIP file in the Google Drive with the user. Rationale Today (6 Feb 2020), we have 18,623 home directories in datahub. Most of these users used datahub in previous semesters, have not logged in for a long time, and will probably never log in again. This costs us a lot of money in disk space - we will have to forever expand disk space. By cleaning it up after 12 months of non-usage, we will not affect any current users - just folks who haven’t logged in for a long time. Archiving the contents would make sure people still have access to their old work, without leaving the burden of maintaining it forever on us. Why Google Drive? For UC Berkeley users, Google Drive offers Unlimited Free Space. We can also perform access control easily with Google Drive. Alternatives 1. Email it to our users. This will most likely be rejected by most mail servers as the home directory will be too big an attachment 2. Put it in Google Cloud Nearline storage, build a token based access control mechanism on top, and email this link to the users. We will need to probably clean this up every 18 months or so for cost reasons. This is the viable alternative, if we decide to not use Google Drive 1.1. Using DataHub 11
UC Berkeley JupyterHubs Documentation 12 Chapter 1. Using DataHub
CHAPTER TWO MODIFYING DATAHUB TO FIT YOUR NEEDS Our infrastructure can serve the diverse needs of our students only if it is built by a diverse array of people. 2.1 Contributing to DataHub 2.1.1 Pre-requisites Smoothly working with the JupyterHubs maintained in this repository has a number of pre-requisite skills you must possess. The rest of the documentation assumes you have at least a basic level of these skills, and know how to get help related to these technologies when necessary. Basic These skills let you interact with the repository in a basic manner. This lets you do most ‘self-service’ tasks - such as adding admin users, libraries, making changes to resource allocation, etc. This doesn’t give you any skills to debug things when they break, however. 1. Basic git & GitHub skills. The Git Book & GitHub Help are good resources for this. 2. Familiarity with YAML syntax. 3. Understanding of how packages are installed in the languages we support. 4. Rights to merge changes into this repository on GitHub. Full In addition to the basic skills, you’ll need the following skills to ‘fully’ work with this repository. Primarily, you need this to debug issues when things break - since we strive to never have things break in the same way more than twice. 1. Knowledge of our tech stack: 1. Kubernetes 2. Google Cloud 3. Helm 4. Docker 5. repo2docker 6. Jupyter 13
UC Berkeley JupyterHubs Documentation 7. Languages we support: Python & R 2. Understanding of our JupyterHub distribution, Zero to JupyterHub. 3. Full access to the various cloud providers we use. 2.1.2 Repository Structure Hub Configuration Each hub has a directory under deployments/ where all configuration for that particular hub is stored in a standard for- mat. For example, all the configuration for the primary hub used on campus (datahub) is stored under deployments/ datahub/. User Image (image/) The contents of the image/ directory determine the environment provided to the user. For example, it controls: 1. Versions of Python / R / Julia available 2. Libraries installed, and which versions of those are installed 3. Specific config for Jupyter Notebook or IPython repo2docker is used to build the actual user image, so you can use any of the supported config files to customize the image as you wish. Hub Config (config/ and secrets/) All our JupyterHubs are based on Zero to JupyterHub (z2jh). z2jh uses configuration files in YAML format to specify exactly how the hub is configured. For example, it controls: 1. RAM available per user 2. Admin user lists 3. User storage information 4. Per-class & Per-user RAM overrides (when classes or individuals need more RAM) 5. Authentication secret keys These files are split between files that are visible to everyone (config/) and files that are visible only to a select few illuminati (secrets/). To get access to the secret files, please consult the illuminati. Files are further split into: 1. common.yaml - Configuration common to staging and production instances of this hub. Most config should be here. 2. staging.yaml - Configuration specific to the staging instance of the hub. 3. prod.yaml - Configuration specific to the production instance of the hub. 14 Chapter 2. Modifying DataHub to fit your needs
UC Berkeley JupyterHubs Documentation hubploy.yaml We use hubploy to deploy our hubs in a repeatable fashion. hubploy.yaml contains information required for hubploy to work - such as cluster name, region, provider, etc. Various secret keys used to authenticate to cloud providers are kept under secrets/ and referred to from hubploy. yaml. Documentation Documentation is under the docs/ folder, and is generated with the sphinx project. It is written with the reStructured- Text (rst) format. Documentation is automatically published to https://ucb-jupyterhubs.readthedocs.io/. 2.1.3 User home directory storage All users on all the hubs get a home directory with persistent storage. Why NFS? NFS isn’t a particularly cloud-native technology. It isn’t highly available nor fault tolerant by default, and is a single point of failure. However, it is currently the best of the alternatives available for user home directories, and so we use it. 1. Home directories need to be fully POSIX compliant file systems that work with minimal edge cases, since this is what most instructional code assumes. This rules out object-store backed filesystems such as s3fs. 2. Users don’t usually need guaranteed space or IOPS, so providing them each a persistent cloud disk gets unnec- essarily expensive - since we are paying for it wether it is used or not. When we did use one persistent disk per user, the storage cost dwarfed everything else by an order of magnitude for no apparent benefit. Attaching cloud disks to user pods also takes on average about 30s on Google Cloud, and much longer on Azure. NFS mounts pretty quickly, getting this down to a second or less. We’ll probably be on some form of NFS for the foreseeable future. NFS Server We currently have two approaches to running NFS Servers. 1. Run a hand-maintained NFS Server with ZFS SSD disks. This gives us control over performance, size and most importantly, server options. We use anonuid=1000, so all reads / writes from the cluster are treated as if they have uid 1000, which is the uid all user processes run as. This prevents us from having to muck about permissions & chowns - particularly since Kubernetes creates new directories on volumes as root with strict permissions (see issue). 2. Use a hosted NFS service like Google Cloud Filestore. We do not have to perform any maintenance if we use this - but we have no control over the host machine either. This necessitates some extra work to deal with the permission issues - see jupyterhub.singleuser. initContainers in the common.yaml of a hub that uses this method. 2.1. Contributing to DataHub 15
UC Berkeley JupyterHubs Documentation Right now, every hub except data8x is using the first approach - primarily because Google Cloud Filestore was not available when they were first set up. data8x is using the second approach, and if proven reliable we will switch everything to it the next semester. Home directory paths Each user on each hub gets their own directory on the server that gets treated as their home directory. The staging & prod servers share home directory paths, so users get the same home directories on both. For most hubs, the user’s home directory path relative to the exported NFS directory is /home/ . Prefixing the path with the name of the hub allows us to use the same NFS share for many number of hubs. NFS Client We currently have two approaches for mounting the user’s home directory into each user’s pod. 1. Mount the NFS Share once per node to a well known location, and use hostpath volumes with a subpath on the user pod to mount the correct directory on the user pod. This lets us get away with one NFS mount per node, rather than one per pod. See hub/templates/ nfs-mounter.yaml to see how we mount this on the nodes. It’s a bit of a hack, and if we want to keep using this method should be turned into a CSI Driver instead. 2. Use the Kubernetes NFS Volume provider. This doesn’t require hacks, but leads to at least 2 NFS mounts per user per node, often leading to hundreds of NFS mounts per node. This might or might not be a problem. Most hubs use the first method, while data8x is trialing the second method. If it goes well, we might switch to using the second method for everything. We also try to mount everything as soft, since we would rather have a write fail than have processes go into uninter- ruptible sleep mode (D) where they can not usually be killed when NFS server runs into issues. 2.1.4 Kubernetes Cluster Configuration We use kubernetes to run our JupyterHubs. It has a healthy open source community, managed offerings from multiple vendors & a fast pace of development. We can run easily on many different cloud providers with similar config by running on top of Kubernetes, so it is also our cloud agnostic abstraction layer. We prefer using a managed Kubernetes service (such as Google Kubernetes Engine). This document lays out our preferred cluster configuration on various cloud providers. Google Kubernetes Engine In our experience, Google Kubernetes Engine (GKE) has been the most stable, performant, and reliable managed kubernetes service. We prefer running on this when possible. A gcloud container clusters create command can succintly express the configuration of our kubernetes clus- ter. The following command represents the currently favored configuration. 16 Chapter 2. Modifying DataHub to fit your needs
UC Berkeley JupyterHubs Documentation gcloud container clusters create \ --enable-ip-alias \ --enable-autoscaling \ --max-nodes=20 --min-nodes=1 \ --region=us-central1 --node-locations=us-central1-b \ --image-type=cos \ --disk-size=100 --disk-type=pd-balanced \ --machine-type=n1-highmem-8 \ --cluster-version latest \ --no-enable-autoupgrade \ --enable-network-policy \ --create-subnetwork="" \ --tags=hub-cluster \ gcloud container node-pools create \ --machine-type n1-highmem-8 \ --num-nodes 1 \ --enable-autoscaling \ --min-nodes 1 --max-nodes 20 \ --node-labels hub.jupyter.org/pool-name=-pool \ --node-taints hub.jupyter.org_dedicated=user:NoSchedule \ --region=us-central1 \ --image-type=cos_containerd \ --disk-size=200 --disk-type=pd-balanced \ --no-enable-autoupgrade \ --tags=hub-cluster \ --cluster=fall-2019 \ user---- IP Aliasing --enable-ip-alias creates VPC Native Clusters. This becomes the default soon, and can be removed once it is the default. Autoscaling We use the kubernetes cluster autoscaler to scale our node count up and down based on demand. It waits until the cluster is completely full before triggering creation of a new node - but that’s ok, since new node creation time on GKE is pretty quick. --enable-autoscaling turns the cluster autoscaler on. --min-nodes sets the minimum number of nodes that will be maintained regardless of demand. This should ideally be 2, to give us some headroom for quick starts without requiring scale ups when the cluster is completely empty. --max-nodes sets the maximum number of nodes that the cluster autoscaler will use - this sets the maximum number of concurrent users we can support. This should be set to a reasonably high number, but not too high - to protect against runaway creation of hundreds of VMs that might drain all our credits due to accident or security breach. 2.1. Contributing to DataHub 17
UC Berkeley JupyterHubs Documentation Highly available master The kubernetes cluster’s master nodes are managed by Google Cloud automatically. By default, it is deployed in a non-highly-available configuration - only one node. This means that upgrades and master configuration changes cause a few minutes of downtime for the kubernetes API, causing new user server starts / stops to fail. We request our cluster masters to have highly available masters with --region parameter. This specifies the region where our 3 master nodes will be spread across in different zones. It costs us extra, but it is totally worth it. By default, asking for highly available masters also asks for 3x the node count, spread across multiple zones. We don’t want that, since all our user pods have in-memory state & can’t be relocated. Specifying --node-locations explicitly lets us control how many and which zones the nodes are located in. Region / Zone selection We generally use the us-central1 region and a zone in it for our clusters - simply because that is where we have asked for quota. There are regions closer to us, but latency hasn’t really mattered so we are currently still in us-central1. There are also unsubstantiated rumors that us-central1 is their biggest data center and hence less likely to run out of quota. Disk Size --disk-size sets the size of the root disk on all the kubernetes nodes. This isn’t used for any persistent storage such as user home directories. It is only used ephemerally for the operations of the cluster - primarily storing docker images and other temporary storage. We can make this larger if we use a large number of big images, or if we want our image pulls to be faster (since disk performance increases with disk size ). --disk-type=pd-standard gives us standard spinning disks, which are cheaper. We can also request SSDs instead with --disk-type=pd-ssd - it is much faster, but also much more expensive. We compromise with --disk-type=pd-balanced, faster than spinning disks but not as fast as ssds all the time. Node size --machine-type lets us select how much RAM and CPU each of our nodes have. For non-trivial hubs, we generally pick n1-highmem-8, with 52G of RAM and 8 cores. This is based on the following heuristics: 1. Students generally are memory limited than CPU limited. In fact, while we have a hard limit on memory use per-user pod, we do not have a CPU limit - it hasn’t proven necessary. 2. We try overprovision clusters by about 2x - so we try to fit about 100G of total RAM use in a node with about 50G of RAM. This is accomplished by setting the memory request to be about half of the memory limit on user pods. This leads to massive cost savings, and works out ok. 3. There is a kubernetes limit on 100 pods per node. Based on these heuristics, n1-highmem-8 seems to be most bang for the buck currently. We should revisit this for every cluster creation. 18 Chapter 2. Modifying DataHub to fit your needs
UC Berkeley JupyterHubs Documentation Cluster version GKE automatically upgrades cluster masters, so there is generally no harm in being on the latest version available. Node autoupgrades When node autoupgrades are enabled, GKE will automatically try to upgrade our nodes whenever needed (our GKE version falling off the support window, security issues, etc). However, since we run stateful workloads, we disable this right now so we can do the upgrades manually. Network Policy Kubernetes Network Policy lets you firewall internal access inside a kubernetes cluster, whitelisting only the flows you want. The JupyterHub chart we use supports setting up appropriate NetworkPolicy objects it needs, so we should turn it on for additional security depth. Note that any extra in-cluster services we run must have a NetworkPolicy set up for them to work reliabliy. Subnetwork We put each cluster in its own subnetwork, since seems to be a limit on how many clusters you can create in the same network with IP aliasing on - you just run out of addresses. This also gives us some isolation - subnetworks are isolated by default and can’t reach other resources. You must add firewall rules to provide access, including access to any manually run NFS servers. We add tags for this. Tags To help with firewalling, we add network tags to all our cluster nodes. This lets us add firewall rules to control traffic between subnetworks. Cluster name We try use a descriptive name as much as possible. 2.1.5 Cloud Credentials Google Cloud Service Accounts Service accounts are identified by a service key, and help us grant specific access to an automated process. Our CI process needs two service accounts to operate: 1. A gcr-readwrite key. This is used to build and push the user images. Based on the docs, this is assigned the role roles/storage.admin. 2. A gke key. This is used to interact with the Google Kubernetes cluster. Roles roles/container.clusterViewer and roles/container.developer are granted to it. 2.1. Contributing to DataHub 19
UC Berkeley JupyterHubs Documentation These are currently copied into the secrets/ dir of every deployment, and explicitly referenced from hubploy.yaml in each deployment. They should be rotated every few months. You can create service accounts through the web console or the commandline. Remember to not leave around copies of the private key elsewhere on your local computer! 2.1.6 Incident reports Blameless incident reports are very important for long term sustainability of resilient infrastructure. We publish them here for transparency, and so we may learn from them for future incidents. 2017-02-09 - JupyterHub db manual overwrite Summary Datahub was reportedly down at 1am. Users attempting to log in to datahub were greeted with a proxy error. The hub pod was up but the log was full of sqlite errors. After the hub pod was deleted and a new one came up, students logging in to datahub found their notebooks were missing and their home directories were empty. Once this was fixed, some students still were being logged in as a different particular user. Finally, students with a ‘.’ in their username were still having issues after everyone else was fine. This was all fixed and an all-clear signalled at about 2017-02-09 11:35 AM. Timeline 2017-02-09 00:25 - 00:29 AM Attempting to debug some earlier 400 errors, Trying to set base_url and ip to something incorrect to see if it will cause a problem. kubectl exec hub-deployment-something --namespace=datahub -it bash apt-get install sqlite3 sqlite3 ATTACH 'jupyterhub.sqlite AS my_db; SELECT name FROM my_db.sqlite_master WHERE type='table'; SELECT * FROM servers; SELECT * FROM servers WHERE base_url LIKE '%%'; UPDATE servers SET ip='' WHERE base_url LIKE '%%'; UPDATE servers SET base_url='/ WHERE base_url LIKE '%%'; Ctrl+D (exit back into bash shell) checked datahub.berkeley.edu, and nothing happened to the account saw that the sql db was not updated, attempt to run .save ```bash sqlite3 .save jupyterhub.sqlite This replaced the db with an empty one, since ATTACH was not run beforehand. 20 Chapter 2. Modifying DataHub to fit your needs
UC Berkeley JupyterHubs Documentation 0:25:59 AM Following exception shows up in hub logs: sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) no such table: proxies [SQL: ˓→'SELECT proxies.id AS proxies_id, proxies._public_server_id AS proxies__public_server_ ˓→id, proxies._api_server_id AS proxies__api_server_id \nFROM proxies \nWHERE proxies.id␣ ˓→= ?'] [parameters: (1,)] This continues for hub table as well, since those two seem to be most frequently used. 1:12 AM Sam’s roommate notices that he can log in to datahub but all his notebooks are gone. We notice that there are only ~50 users on the JHub admin panel when there used to be ~1000, so we believe that this is because the JHub sqlite user database got wiped/corrupted, then created an account for his roommate when he logged in, then created a new persistent disk since it lost track of his old one. This is confirmed soon after: $ kubectl --namespace=datahub get pvc | grep claim--257 Bound pvc-3b405e13-ddb4-11e6-98ef-42010af000c3 10Gi ␣ ˓→RWO 21d claim--51 Bound pvc-643dd900-eea7-11e6-a291-42010af000c3 10Gi ␣ ˓→RWO 5m 1:28 AM We shut down the hub pod by scaling the replicas to 0. We then begin recreating the JHub sqlite database by taking the Kubernetes PVCs and matching them back with the user ids. We could do this because the name of the PVC contains a sanitized form of the username and the userid. Here’s the notebook that was used to recreate the db from PVCs: [pvc-sqlite.ipynb 2017-02-09-datahub-db-outage- pvc-recreate-script.ipynb](pvc-sqlite.ipynb 2017-02-09-datahub-db-outage-pvc-recreate-script.ipynb) 2:34 AM We recreate the sqlite3 database. Initially each user’s cookie_id was set to a dummy cookie value. 2:42 AM User cookie_id values are changed to null rather than dummy value. The sqlite file is then attached back to datahub. The number of users shown on admin page is back to ~1000. The hub was up, and a spot check of starting other user’s servers seem to work. Some users get redirected to one particular user, but deleting and recreating the affected user seems to fix this. 2.1. Contributing to DataHub 21
UC Berkeley JupyterHubs Documentation 10:11 AM Attempt to log everyone out by changing cookie secret in hub pod at /srv/jupyterhub/jupyterhub_cookie_secret. Just one character near the end was changed, and pod restarted. No effect. One character at the beginning of secret was changed next, and restarted - this caused actual change, and logged all users out. People are still being redirected to one particular user’s account when they log in. More looking around required. 10:17 AM John Denero advises students to use ds8.berkeley.edu right now. ds8.berkeley.edu promptly starts crashing because it does not have resources for a data8 level class. 10:29 AM All user pods are deleted, which finally properly logs everyone out. However, people logging in are still all getting the same user’s pods. 10:36 AM Notice that cookie_id column in the user database table is empty for many users, and the user that everyone is being logged in as has an empty cookie_id too and is the ‘first’ on the table when sorted in ascending by id. Looking at the JupyterHub code, cookie_id is always supposed to be set to a uuid, and never supposed to be empty. Setting cookie_id for users fixes their issues, and seems to spawn them into their own notebook. 10:45 AM A script is run that populates cookie_id for all users, and restarts the hub to make sure there’s no stale cache in RAM. All user pods are deleted again. Most users are back online now! More users start testing and confirming things are working for them. 10:53 AM User with a ‘.’ in their name reports that they’re getting an empty home directory. More investigation shows two users - one with a ‘.’ in their name that is newer, and one with a ‘-’ in their name instead of ‘.’ that is older. Hypothesis is that one of them is the ‘original’, but they’re all attaching to a new one that is empty. Looking at pvcs confirms this - there are two PVCs for users with a . in their name who have tried to log in, and they differ only by ids. There is some confusion about users ending up on prob140, because the data8.org homework link is changed to use that temporarily. 22 Chapter 2. Modifying DataHub to fit your needs
UC Berkeley JupyterHubs Documentation 11:05 AM Directly modifying the user table to rename the user with the ‘-’ in the name to have a ‘.’ seems to work for people. 11:15 AM A script is run that modifies the database user table for all users with a ‘-’ in their name, and the ‘-’ is replaced with a ‘.’. The new users created with the ‘.’ in their name are dropped before this. 11:17 AM All clear given for datahub.berkeley.edu 11:19 AM Locally verified that running .save on sqlite3 will overwrite the db file without any confirmation, and is most likely cause of the issue. Conclusion Accidental overwriting of the sqlite file during routine debugging operation led all tables being deleted. Users were getting new user ids when they were logging in now, causing them to get new disks provisioned - and these disks were empty. During reconstruction of the db, cookie_id was missing for several users, causing them all to log in to one particular user’s notebook. Users with ‘.’ in their name were also set up slightly incorrectly - their pods have ‘-’ in them but the user name should have a ‘.’. Action items Upstream bug reports for JupyterHub 1. JupyterHub only uses a certain length of the cookie secret, and discards the rest. This causes confusion when trying to change it to log people out. Issue 2. The cookie_id column in the users table should have UNIQUE and NOT NULL constraints. Issue Upstream bug reports for KubeSpawner 1. Support using username hashes in PVC and Pod Names rather than user ids, so that pod and PVC names remain constant even when DB is deleted. Issue Upstream bug reports for OAuthenticator 1. Support setting id of user in user table to be same as ‘id’ provided by Google authenticator, thus providing a stable userid regardless of when the user first logged in. Issue 2.1. Contributing to DataHub 23
UC Berkeley JupyterHubs Documentation DataHub deployment changes 1. Switch to using Google Cloud SQL, which provides hosted and managed MySQL database 2. Perform regular and tested backups of the database 3. Start writing an operational FAQ for things to do and not do 4. Setup better monitoring and paging systems 5. Document escalation procedures explicitly 2017-02-24 - Custom Autoscaler gonee haywire Summary On the evening of February 24, 2017, a premature version of the Autoscaler script for the Datahub deployment was mistakenly run on the prod cluster, resulting in a large amount of nodes (roughly 30-40) being set as unschedulable for about 20 minutes. Though no information was lost nor service critically disturbed, it was necessary to manually re-enable these nodes to be scheduled. Timeline As of this commit in the Autoscaler branch history, there exists a scale.py file that would based on the utilization of the cluster, mark a certain number of nodes unschedulable before attempting to shut down nodes with no pods in them. Unfortunately, this script was executed prematurely, and without configuration, looked to execute in whatever context currently specified in .kube/config, which ended up being the production cluster rather than the dev cluster. 2017-02-24 11:14 PM Script is mistakenly executed. A bug in the calculations for the utilization of the cluster leads to about 40 nodes being marked as unschedulable. The mistake is noted immediately. 2017-02-24 11:26 PM The unschedulability of these nodes is reverted. All nodes in the cluster were first all set to be schedulable to ensure that no students current and future would be disturbed. Immediately after, 10 of the most idle nodes on the cluster were manually set to be unschedulable (to facilitate them later being manually descaled - to deal with https://github.com/data- 8/infrastructure/issues/6) using kubectl cordon . Conclusion A cluster autoscaler script was accidentally run against the production cluster instead of the dev cluster, reducing capacity for new user logins for about 12 minutes. There was still enough capacity so we had no adverse effects. 24 Chapter 2. Modifying DataHub to fit your needs
UC Berkeley JupyterHubs Documentation Action Items Datahub Deployment Changes 1. The Autoscaler should not be run unless the context is explicitly set via environment variables or command line arguments. This is noted in the comments of the pull request for the Autoscaler. 2. The idea of the ‘current context’ should be abolished in all the tools we build / read. Future organizational change 1. Use a separate billing account for production vs development clusters. This makes it harder to accidentally run things on the wrong cluster 2017-02-24 - Proxy eviction strands user Summary On the evening of Feb 23, several students started experiencing 500 errors in trying to access datahub. The proxy had died because of a known issue, and it took a while for the hub to re-add all the user routes to the proxy. Some students’ needed their servers to be manually restarted, due to a JupyterHub spawner bug that is showing up at scale. Everything was fixed in about 40 minutes. Timeline All times in PST 21:10:57 The proxy pod is evicted, due to a known issue that is currently being worked on. Users start running into issue now, with connection failures. 21:11:04 New proxy pod is started by kubernetes, and starts accepting connections. However, the JupyterHub model currently has the proxy starting with no state about user routes, and so the users’ requests aren’t being routed to their notebook pods. This manifests as errors for users. The hub process is supposed to poll the proxy every 300s, and repopulate the route table when it notices it is empty. The hub does this at some point in the next 300s (we do not know when), and starts repopulating the route table. As routes get added for currently users, their notebook starts working again. 2.1. Contributing to DataHub 25
UC Berkeley JupyterHubs Documentation 21:11:52 The repopulate process starts running into issues - it is making far too many http requests (to the kuber- netes and proxy APIs) that it starts running into client side limits on tornado http client (which is what we use to make these requests). This causes them to time out on the request queue. We were running into https://github.com/tornadoweb/tornado/issues/1400. Not all requests fail - for those that succeed, the students are able to access their notebooks. The repopulate process takes a while to process, and errors for a lot of students who are left with notebook in inconsistent state - JupyterHub thinks their notebook is running but it isn’t, or vice versa. Lots of 500s for users. 21:14 Reports of errors start reaching the Slack channel + Piazza. The repopulate process keeps being retried, and notebooks for users slowly come back. Some users are ‘stuck’ in a bad state, however - their notebook isn’t running, but JupyterHub thinks it is (or vice versa). 21:34 Most users are fine by now. For those still with problems, a forced delete from the admin interface + a start works, since this forces JupyterHub to really check if they’re there or not. 22:03 Last reported user with 500 error is fixed, and datahub is fully operational again. Conclusion This is almost a ‘perfect storm’ event. Three things colluded to make this outage happen: 1. The inodes issue, which causes containers to fail randomly 2. The fact that the proxy is a single point of failure with a longish recovery time in current JupyterHub architecture. 3. KubeSpawner’s current design is inefficient at very high user volumes, and its request timeouts & other perfor- mance characteristics had not been tuned (because we have not needed to before). We have both long term (~1-2 months) architectural fixes as well as short term tuning in place for all three of these issues. Action items Upstream JupyterHub 1. Work on abstracting the proxy interface, so the proxy is no longer a single point of failure. Issue 26 Chapter 2. Modifying DataHub to fit your needs
UC Berkeley JupyterHubs Documentation Upstream KubeSpawner 1. Re-architect the spawner to make a much smaller number of HTTP requests. DataHub has become big enough that this is a problem. Issue 2. Tune the HTTP client kubespawner uses. This would be an interim solution until (1) gets fixed. Issue DataHub configuration 1. Set resource requests explicitly for hub and proxy, so they have less chance of getting evicted. Issue 2. Reduce the interval at which the hub checks to see if the proxy is running. PR 3. Speed up the fix for the inodes issue which is what triggered this whole issue. 2017-03-06 - Non-matching hub image tags cause downtime Summary On the evening of Mar 6, the hub on prod would not come up after an upgrade. The upgrade was to accommodate a new disk for cogneuro that had been tested on dev. After some investigation it was determined that the helm’s config did not match the hub’s image. After the hub image was rebuilt and pushed out, then tested on dev, it was pushed out to prod. The problem was fixed in about 40 minutes. A few days later (March 12), similar almost outage is avoided when -dev breaks and deployment is put on hold. More debugging shows the underlying cause is that git submodules are hard to use. More documentation is provided, and downtime is averted! Timeline All times in PST March 6 2017 22:59 dev changes are deployed but hub does not start correctly. The describe output for the hub shows repeated instances of: Error syncing pod, skipping: failed to “StartContainer” for “hub-container” with CrashLoopBackOff: “Back-off 10s restarting failed container=hub-container pod=hub-deployment-3498421336-91gp3_datahub-dev(bfe7d8bd-0303- 11e7-ade6-42010a80001a) helm chart for -dev is deleted and reinstalled. 2.1. Contributing to DataHub 27
UC Berkeley JupyterHubs Documentation 23:11 dev changes are deployed successfully and tested. cogneuro’s latest data is available. 23:21 Changes are deployed to prod. The hub does not start properly. get pod -o=yaml on the hub pod shows that the hub container has terminated. The hub log shows that it failed due to a bad configuration parameter. 21:31 While the helm chart had been updated from git recently, the latest tag for the hub did not correspond with the one in either prod.yaml or dev.yaml. 21:41 The hub image is rebuilt and pushed out. 21:45 The hub is deployed on -dev. 21:46 The hub is tested on -dev then deployed on -prod. 21:50 The hub is tested on -prod. Students are reporting that the hub had been down. March 12 19:57 A new deploy is attempted on -dev, but runs into same error. Deployments are halted for more debugging this time, and more people are called on. 23:21 More debugging reveals that the commit update looked like this: diff --git a/chart b/chart index e38aba2..c590340 160000 --- a/chart +++ b/chart @@ -1 +1 @@ -Subproject commit e38aba2c5601de30c01c6f3c5cad61a4bf0a1778 (continues on next page) 28 Chapter 2. Modifying DataHub to fit your needs
UC Berkeley JupyterHubs Documentation (continued from previous page) +Subproject commit c59034032f8870d16daba7599407db7e6eb53e04 diff --git a/data8/dev.yaml b/data8/dev.yaml index 2bda156..ee5987b 100644 --- a/data8/dev.yaml +++ b/data8/dev.yaml @@ -13,7 +13,7 @@ publicIP: "104.197.166.226" singleuser: image: - tag: "e4af695" + tag: "1a6c6d8" mounts: shared: cogneuro88: "cogneuro88-20170307-063643" Only the tag should’ve been the only thing updated. The chart submodule is updated to c59034032f8870d16daba7599407db7e6eb53e04, which is from February 25 (almost two weeks old). This is the cause of the hub failing, since it is using a really old chart commit with a new hub image. 23:27 It is determined that incomplete documentation about deployment processes caused git submodule update to be not run after a git pull, and so the chart was being accidentally moved back to older commits. Looking at the commit that caused the outage on March 6 showed the exact same root cause. Conclusion Git submodules are hard to use, and break most people’s mental model of how git works. Since our deployment requires that the submodule by in sync with the images used, this caused an outage. Action items Process 1. Make sure we treat any errors in -dev exactly like we would in prod. Any deployment error in prod should immediately halt future deployments & require a rollback or resolution before proceeding. 2. Write down actual deployment documentation & a checklist. 3. Move away from git submodules to a separate versioned chart repository. 2.1. Contributing to DataHub 29
UC Berkeley JupyterHubs Documentation 2017-03-20 - Too many volumes per disk leave students stuck Summary From sometime early March 20 2017 till about 1300, some new student servers were stuck in Pending forever, giving them 500 errors. This was an unintended side-effect of reducing student memory limit to 1G while keeping the size of our nodes constant, causing us to hit a Google Cloud limit on number of disks per node. This was fixed by spawning more nodes that were smaller. Timeline March 18, 16:30 RAM per student is reduced from 2G to 1G, as a resource optimization measure. The size of our nodes remains the same (26G RAM), and many are cordonned off and slowly decomissioned over the coming few days. Life seems fine, given the circumstances. March 20, 12:44 New student servers report a 500 error preventing them from logging on. This is deemed widespread & not an isolated incident. 12:53 A kubectl describe pod on an affected student’s pod shows it’s stuck in Pending state, with an error message: pod failed to fit in any node fit failure on node (XX): MaxVolumeCount This seems to be common problem for all the new student servers, which are all stuck in Pending state. Googling leads to https://github.com/kubernetes/kubernetes/issues/24317 - even though Google Compute Engine can handle more than 16 disks per node (we had checked this before deploying), Kubernetes itself still can not. This wasn’t foreseen, and seemed to be the direct cause of the incident. 13:03 A copy of the instance template that is used by Google Container Engine is made and then modified to spawn smaller nodes (n1-highmem-2 rather than n1-highmem-4). The managed instance group used by Google Container Engine is then modified to use the new template. This was the easiest way to not distrupt students for whom things are working, while also allowing new students to be able to log in. This new instance group was then set to expand for 30 new nodes, which will provide capacity for about 12 students each. populate.bash was also run to make sure that students pods start up on time in the newnodes. 30 Chapter 2. Modifying DataHub to fit your needs
UC Berkeley JupyterHubs Documentation 13:04 The simple autoscaler is stopped, on fear that it’ll be confused by the unusal mixed state of the nodes and do something wonky. 13:11 All the new nodes are online, and populate.bash has completed. Pods start leaving the Pending state. However, since it’s been more than the specified timeout that JupyterHub will wait before giving up on Pod (5 minutes), JupyterHub doesn’t know the pods exist. This causes state of cluster + state in JupyterHub to go out of sync, causing the dreaded ‘redirected too many times’ error. Admins need to manually stop and start user pods in the control panel as users report this to fix this issue. 14:23 The hub and proxy pods are restarted since there were plenty of ‘redirected too many times’ errors. This seems to catch most users state, although some requests still failed with a 599 timeout (similar to an earlier incident, but much less frequent). A long tail of manual user restarts are performed by admins over the next few days. Action Items Upstream: Kubernetes 1. Keep an eye on the status of the bug we ran into Upstream: JupyterHub 1. Track down and fix the ‘too many redirects’ issue at source. Issue Cleanup 1. Delete all the older larger nodes that are no longer in use. (Done!) Monitoring 1. Have alerting for when there are any number of pods in Pending state for a non-negligible amount of time. There is always something wrong when this happens. 2.1. Contributing to DataHub 31
UC Berkeley JupyterHubs Documentation 2017-03-23 - Weird upstream ipython bug kills kernels Summary A seemingly unrelated change caused user kernels to die on start (making notebook execution impossible) for newly started user servers from about Mar 22 19:30 to Mar 23 09:45. Most users didn’t see any errors until start of class at about 9AM, since they were running servers that were previously started. Timeline March 22, around 19:30 A deployment is performed, finally deploying https://github.com/data-8/jupyterhub-k8s/pull/146 to production. It seemed to work fine on -dev, and on prod as well. However, the testing regimen was only to see if a notebook server would show up - not if a kernel would spawn. Mar 23, 09:08 Students report that their kernels keep dying. This is confirmed to be a problem for all newly launched notebooks, in both prod and dev. 09:16 The last change to the repo (an update of the single-user image) is reverted, to check if that was causing the problem. This does not improve the situation. Debugging continues, but with no obvious angles of attack. 09:41 After debugging produces no obvious culprits, the state of the entire infrastructure for prod is reverted to a known good state from a few days ago. This was done with: ./deploy.py prod data8 25abea764121953538713134e8a08e0291813834 25abea764121953538713134e8a08e0291813834 is the commit hash of a known good commit from March 19. Our disciplined adherence to immutable & reproducible deployment paid off, and we were able to restore new servers to working order with this! Students are now able to resume working after a server restart. A mass restart is also performed to aid this. Dev is left in a broken state in an attempt to debug. 32 Chapter 2. Modifying DataHub to fit your needs
UC Berkeley JupyterHubs Documentation 09:48 A core Jupyter Notebook dev at BIDS attempts to debug the problem, since it seems to be with the notebook itself and not with JupyterHub. 11:08 Core Jupyter Notebook dev confirms that this makes no sense. 14:55 Attempts to isolate the bug start again, mostly by using git bisect to deploy different versions of our infrastructure to dev until we find what broke. 15:30 https://github.com/data-8/jupyterhub-k8s/pull/146 is identified as the culprit. It continues to not make sense. 17:25 A very involved and laborious revert of the offending part of the patch is done in https://github.com/jupyterhub/kubespawner/pull/37. Core Jupyter Notebook dev continues to confirm this makes no sense. https://github.com/data-8/jupyterhub-k8s/pull/152 is also merged, and deployed shortly after verifiying that everything (including starting kernels & executing code) works fine on dev. Deployed to prod and everything is fine. Conclusion Insufficient testing procedures caused a new kind of outage (kernel dying) that we had not seen before. However, since our infrastructure was immutable & reproducible, our outage really only lasted about 40 minutes (from start of lab when students were starting containers until the revert). Deeper debugging produced a fix, but attempts to understand why the fix works are ongoing. Update: We have found and fixed the underlying issue Action items Process 1. Document and formalize the testing process for post-deployment checks. 2. Set a short timeout (maybe ten minutes?) after which investigation temporarily stops and we revert our deploy- ment to a known good state. 2.1. Contributing to DataHub 33
UC Berkeley JupyterHubs Documentation Upstream KubeSpawner 1. Continue investigating https://github.com/jupyterhub/kubespawner/issues/31, which was the core issue that prompted the changes that eventually led to the outage. 2017-04-03 - Custom autoscaler does not scale up when it should Summary On April 3, 2017, as students were returning from spring break, the cluster wasn’t big enough in time and several students had errors spawning. This was because the simple-autoscaler was ‘stuck’ on a populate call. More capacity was manually added, the pending pods were deleted & this seemed to fix the outage. Timeline Over spring break week The cluster is scaled down to a much smaller size (7 machines), and the simple scaler is left running. 2017-04-03 11:32 Students report datahub isn’t working on Piazza, and lots of Pods in PENDING state. Doing a kubectl --namespace=datahub describe pod said the pod was unschedulable because there wasn’t enough RAM in the cluster. This clearly implied the cluster wasn’t big enough. Looking at the simple scaler shows it was ‘stuck’ at a populate.bash call, and wasn’t scaling up fast enough. 11:35 The cluster is manually scaled up to 30 nodes: gcloud compute instance-groups managed resize gke-prod-highmem-pool-0df1a536-grp -- ˓→size=30 At the same time, pods stuck in Pending state are deleted so they don’t become ghost pods, with: kubectl --namespace=datahub get pod | grep -v Running | grep -P 'm$' | awk '{print $1;}'␣ ˓→| xargs -L1 kubectl --namespace=datahub delete pod 11:40 The nodes have come up, so a populate.bash call is performed to pre-populate all user container images on the new nodes. Users in Pending state are deleted again. 34 Chapter 2. Modifying DataHub to fit your needs
UC Berkeley JupyterHubs Documentation 11:46 The populate.bash call is complete, and everything is back online! Conclusion Our simple scaler didn’t scale up fast enough when a large number of students came back online quickly after a time of quiet (spring break). Took a while for this to get noticed, and manual scaling fixed everything. Action items Process 1. When coming back from breaks, pre-scale the cluster back up. 2. Consider cancelling spring break. Monitoring 1. Have monitoring for pods stuck in non-Running states 2017-05-09 - Oops we forgot to pay the bill Summary On May 9, 2017, the compute resources associated with the data-8 project at GCE were suspended. All hubs including datahub, stat28, and prob140 were not reachable. This happened because the grant that backed the project’s billing account ran out of funds. The project was moved to a different funding source and the resources gradually came back online. Timeline 2017-05-09 16:51 A report in the Data 8 Spring 2017 Staff slack, #jupyter channel, says that datahub is down. This is confirmed. At- tempting to access the provisioner via gcloud compute ssh provisioner-01 fails with: ERROR: (gcloud.compute.ssh) Instance [provisioner-01] in zone [us-central1-a] has not been allocated an external IP address yet. Try rerunning this command later. 2.1. Contributing to DataHub 35
UC Berkeley JupyterHubs Documentation 17:01 The Google Cloud console shows that the billing account has run out of the grant that supported the data-8 project. The project account is moved to another billing account which has resources left. The billing state is confirmed by gcloud messages: Google Compute Engine: Project data-8 cannot accept requests to setMetadata while in an␣ ˓→inactive billing state. Billing state may take several minutes to update. 17:09 provisioner-01 is manually started. All pods in the datahub namespace are deleted. 17:15 datahub is back online. stat28 and prob140 hub pods are manually killed. After a few moments the hubs are back online. The autoscaler is started. 17:19 The slack duplicator is started. 2017-05-10 10:48 A report in uc-jupyter #jupyterhub says that try.datahub is down. This is confirmed and the hub in the tmp namespace is killed. The hub comes online a couple of minutes later. Conclusion There was insufficient monitoring of the billing status. Action items Process 1. Identify channels for billing alerts. 2. Identify billing threshold functions that predict when funds will run out. 3. Establish off-cloud backups. The plan is to do this via nbgdrive. 4. Start autoscaler automatically. It is manually started at the moment. 36 Chapter 2. Modifying DataHub to fit your needs
You can also read