JupyterCon 2020

These are my notes from some talks from JupyterCon 2020. They are not sorted at all, nor in order of talks nor in order of writing, but I hope you will learn something as I did.

Elyra

Elyra is big JupyterLab extension developed in IBM. Have you ever created a super long notebook that did just everything? If so, then maybe it would be more clear and readable if you divided it into parts. Then you can create a pipeline file using Elyra and connect notebooks for collecting data, cleaning, analysis and predictive modeling, for example.

What immediately came to my mind was Kaggle competition from last year "Predicting Molecular Properties". There was basically just a single target, but models just worked best if divided into eight groups by some criteria. So I had A LOT of notebooks, for cleaning, multiple notebooks for feature engineering, and then eight pipelines for hyperoptimization and training and predicting and ensembling. I did not develop everything in order, late in the competition I found new features I would like to use and then had to run every single notebook again. It would have been just so much easier if I could have used this extension, define a pipeline and run everything at once.

Another thing that Elyra enables you to do is to diff notebook versus it's previous version right in the JupyterLab. However, I would still recommend to use Jupytext instead and view diffs in .py files.

Jupytext

Versioning Jupyter notebooks is was hard, but it is a solved problem now because of Jupytext! Earlier this year my friend Juraj Palka was trying to find ideal editor for Python. I told him about JupyterLab but he didn't like it much because of versioning notebooks and instead went for Visual Studio Code, where he used notebooks anyway, but there was a single button to convert .ipynb to .py. I was actually using it ever since, even though I have VSCode installed almost exclusively for this single purpose.

I had that it-is-amazing-that-something-like-this-exists! moment when Marc Wouts introduced Jupytext as JupyterLab extension which automatically creates a .py file for your .ipynb file and keeps them synchronized. You then continue to work in JupyterLab and when you decide to version your work you simply commit the .py file.

BigQuery

Google showed how to use magic commands in Jupyter. Basically you just:

pip install google-cloud-bigquery

Afterwards, load the extension in Jupyter:

%load_ext google.cloud.bigquery

And you can call the magic command (optionally with the name of a dataframe where the final data should be saved) and past the query right in Jupyter notebook.

%bigquery df

SELECT * FROM ...

If you want to find more about it, just search for "Jupyter IPython BigQuery magic" or follow this link for Training Data Analyst.

Most certainly I just won't use this. The idea is great, basically let BigQuery do the heavy lifting, aggregations of big data and then just collect the aggregated table into a dataframe. However, I am actually already doing this. I usually have long queries because the data I work with are messy so I prefer to save .sql in a separate folder. I have a function that prints me the cost of query and asks whether I want to proceed in order to avoid paying some dollars for a mistake. The function also caches the data so when run again, it does not run the query again. Plus I can add arguments which are written to the query using .format method.

nbdev and fastdoc

Two years back at JupyterCon 2018 Joel Grus presented how he does like notebooks. It is really amusing presentation considering that it was presented on the conference specifically about notebooks. If you did not see it, I recommend to watch it.

This year Jeremy Howard presented "Creating delightful libraries and books with nbdev and fastdoc" responding to Joel's presentation - showing usecases of notebooks, solving problems Joel had with notebooks and also presenting nbdev and fastdoc.

nbdev is a package that enables you to effectively develop libraries right in a notebook. Why you should do that? I recommend to watch Jeremy's presentation, but shortly - because you can have everything in one place and describe why functions were written in the way they are. Let's say you got an idea for a function that will ease your work - you start with describing your idea, then you write the function, then you try the function if it is working (because you are in Jupyter notebook and you just can) and then make a test from the try. You mark the function as to-be-exported and the test as well. Run the nbdev and you have a package. But you don't have just the package, but you have also a notebook explaining your idea, your complete thought-process and you can quickly introduce it to somebody else using the notebook.

Lux

Lux is visualization library that adds a simple toggle button for a dataframe that is produced as output of a cell. When you click the button the printed table is replaced by graphs of variables that Lux considers important. I have to check this library, I am interested whether it actually produces graphs I find important. Also, pandas_profiling does the analysis of the whole dataframe as well, but if the dataframe is big, it just does not work and you have to explore only a sample, I am afraid whether it won't be problem here as well.

xfeat

xfeat is a library for combining features used for machine learning. Instead of you having to combine features by yourself, you just specify dataframe, operation and columns to be excluded from combining and xfeat combines the feature for you, adding them to the dataframe.

JupyterBook

Five years back, when I was about to start writing a bachelor thesis, my consultant and friend Petr Zikán pushed me towards making my thesis about simulations of probe measurements in plasma interactive using IPython notebooks. I tried it back than with nbconvert, but couldn't make it work in a way I would like.

However, if JupyterBook existed back then I wouldn't hesitate for a moment. If you don't believe me, Jeremy Howard with Sylvain Gugger wrote a book about deep learning using JupyterBook. Firas Moosvi who presented JupyterBook at JupyterCon 2020 shown an example of his JupyterBook he uses for teaching (he also created a template). I really wish this was used more everywhere in high schools and universities, because having books that can include educational videos and what more you can try simulations and computations yourself, fiddling with parameters is definitely more engaging and educative.

Hmmm, now when I think about it, I still want to do a Ph. D. some day and interactive thesis will be really cool.

jupyter-fs

Currently, I work with Jupyter only locally, but if I wasn't and you are not and you have notebooks stored at different locations, you may find Jupyter extension jupyter-fs very useful. Using it, you can add additional locations, so you can easily switch between your local notebooks and e.g. your notebooks stored at S3 bucket.

Commenting - the unsolved problem

Comments for notebooks still seem like unresolved problem for Jupyter. At work I still have to export solutions from Jupyter notebook to Google sheets or docs, because people who do not code and need to see solution, need to be able to comment and ask questions. Yet, it is still not possible to do in Jupyter.

There are a few companies and notebooks working on this, for example Iooxa (mainly for teaching and academic purposes) which seems similar to Deepnote a bit. Facebook presented their Bento notebooks and it seemed that it is possible to make comments in their notebooks, but Bento notebooks won't be released for public at least for a few years.

There are some solutions, e.g. using Sidestickles or using Google Colab, but apart from these, we will just need to wait and hope that this essential functionality will be added.

Tensorly

Animashree Anandkumar, professor at Caltech, described tensorly a library for tensor operating. Altogether with it she described her research. I like the most the part about low rank decompositions of tensors to make machine learning models more robust. They have less parameters and are more accurate at the same time.

Other things I want to write about

Fastpages, Voila, Bokeh + ipywidgets, pydantic - confloat, FastAPI, nbcelltests.

Thoughts

From Omoju Miller's talk: Tolerate ignorance so all can participate. What follows are my thoughts about the sentence, not the record of the talk.

When I was younger, my best friend Mirek started to teach my programming in Python. Now, I am still terrified when I think about how much of his time I spent by asking him questions. Nevertheless he always answered me, even though I could have found the answer myself by googling. I started with programming and after some time moved to finding the answers for my questions by myself. I learned to have respect for other people's time. Now when I ask somebody, I do the research and really specify the question before asking, because unspecific questions are just waste of other people's time.

I have a few people asking me questions about Python now and I am always happy when they ask for something. One of them told me that it has to super bothering for me when they ask so many questions, but I replied that it is not, because I know they are just learning. I also know that asking somebody is just the first step. Later, when they know more, they will find answers themselves and will not need to ask anymore and if they do, they will do the research and ask a specific question.

Similarly, when somebody asks you something and the question is not specific consider spending your time with them, finding out what it is they are really asking for, because they might be just started to learn about something, on the first step of the way, not being disrespectful to your time.

People should think about it more, especially when answering on StackOverflow or GitHub, because those who ask could lose interest in continuing with whatever it is they are learning, when there is nobody to help them start.