Chapters for Technical Podcasts

Course-style video experiment for podcast content - Henry AI Labs

Jan 18, 2022

Hey everyone!

To new readers, Henry AI Labs is a YouTube channel that primarily publishes videos explaining scientific papers related to Deep Learning. Henry AI Labs has recently partnered with SeMI Technologies to explain the science of Deep Learning for Search. My name is Connor Shorten, I am a Ph.D. student that is primarily working on Henry AI Labs. If interested, my academic publications can be found here. I am always open for academic collaboration on Twitter @CShorten30.

My primary medium for YouTube has been 20-minute paper explanation videos. These videos really help me learn about cutting-edge Deep Learning, but it is very challenging to keep up with the pace of new research. I have been interested in the idea of stacking several videos together into a 3-hour long episode. This 3-hour content medium has seen success with The Lex Fridman Podcast and Machine Learning Street Talk. However, a 3-hour stack of paper explanations is much more artistically challenging than a podcast (in my personal experience). My guess is that we have more long-term attention bias to conversations between people rather than a single monolog.

I have recently been working on the Weaviate Vector Search Engine with SeMI Technologies. Weaviate offers a very interesting bridge between product and science. Additionally, Weaviate has developed a vibrant Slack community that really facilitates podcast discussions.

Following is a decomposition of the podcast into the topics that were discussed. In total, this podcast contains 15 major research topics in Deep Learning for Search and the applications of Weaviate for Slack chats. Please see more in the first video: Chapters for Weaviate Podcast #5 (NLP for Slack Chats - Michael Wechner).

TLDR: Outline

Deep Learning for Search - January 2022

I have spent the last 3 months studying the Weaviate Vector Search Engine. I am a PhD student who has been studying Deep Learning and making videos about emerging ideas. Search really caught my attention when studying Deep Learning applications for COVID-19. I was heavily inspired by the CO-Search system from Salesforce research, as well as parallel work in NLP such as the Text-to-Text Transfer Transformer (T5), and Retrieval-Augmented Generation (RAG). I was delighted when Bob van Luijt contacted me to collaborate and have since learned so much about Deep Learning for Search. This video presents an overview of these ideas. I intend to update this as I continue making podcasts for SeMI Technologies / Weaviate.

Natural Language Processing for Slack

Group chat technology, such as Slack, has been one of the most interesting advances in software for collaboration. Slack chats keep a public record of question-answer pairs, discussions, and community announcements. Michael Wechner is developing Katie, a system to add Search and NLP functionality to these slack chats. I think the idea of detecting duplicate questions could have a massive impact and especially help us with the psychology bottlenecks of not wanting to ask too many questions!

Working in Public: The Making and Maintenance of Open-Source Software

This is one of my favorite books I have read -- to be honest one of very few I have completely read from start to finish recently. Working in Public describes the ecosystem around software contributions, blog posts, youtube videos, and more. As described in the video, there are 2 parts to the book. The first of which is about the platforms and trying to figure out why people contribute to open-source to begin with. The second part of the book is about the challenge of maintaining these projects, especially the strain on the core developer team's time. Michael Wechner is building a tool to add question answering, duplicate question detection, and search to problem like this! I am really excited about the impact this technology could have and the further development of open-source collaboration!

Academic Datasets and Real-World Applications!

Michael Wechner is developing Katie, a duplicate question detection system for slack chats. In addition to academic datasets such as SQuAD for question answering and FEVER for fact verification. We have one of the best academic datasets out there for duplicate question detection in QQP (Quora Question Pairs). Quora has published over 400K duplicate question annotations, and even hosted a Kaggle competition to develop this. I think this is an extremely interesting case of understanding how well these academic benchmarks generalize to startup ideas and real-world applications!

Confidence and Certainty in Deep Learning

This video touches on the topic of Confidence in Deep Learning. In our conversation, we are primarily concerned with how Confidence and Certainty can aid in Human-Computer Interaction and trust in our search systems. Confidence is additionally used in all sorts of ways from regularizing self- and semi-supervised learning (sorry forgot to include that in the video), to early exiting architectures such as PonderNet, and Active Learning. Mind your Outliers! is an interesting paper that makes us question how well confidence really works in Active Learning.

Off-the-Shelf Models versus Fine-Tuning

This is one of the most important topics in Deep Learning at the moment. GPT-3 has, somewhat unreasonably, been able to perform Few-Shot Learning by giving repeatedly applying a fixed task description with a few input-output examples in the input. Although this is amazing, many people studying Machine Learning may be skeptical that this can surpass Fine-Tuned models. Fine-Tuned models in NLP are particularly adapted to the vocabulary with custom tokenizers, as well as nuances to the domain. This is really interesting with Retrieve-then-Read pipelines where we might not need to fine-tune both the retriever AND the reader, maybe just the reader. I hope you find this interesting!

The Katie Architecture for Search

In this video, Michael explains the architecture they are using to bring Deep Learning for Search, NLP, and Weaviate to Slack Chats. I think it is really interesting to see how each Application / Use Case customizes the general framework of Search components and which pieces are the most useful. If you like this video you will probably also find "Deep Learning for Search - January 15th, 2022" to be useful, which is also in this playlist.

Understanding Queries with Nearest Neighbor Visualization

Debugging is a common practice in software engineering to understand why the thing isn't working. There are some special considerations when debugging Deep Learning-based software systems such as robustness and domain generalization in addition to i.i.d. train-test data splits. When debugging Question-Answering systems it may be useful to visualize the nearest neighbors of the vector embedding of the Question to get a sense of what the model is predicting. This idea is very similar to viewing the retrieved-context however, a graph-structured User Interface may be more intuitive for human developers and debuggers.

Thinking Fast and Slow - Application in Search

Nature-inspired Artificial Intelligence is one of the most interesting ways of thinking about the technology. More particularly, System 1 / System 2 thinking, outlined in the book "Thinking Fast and Slow" has been an exciting framework for this. System 1 thinking refers to subconscious intuition, or quick thinking. System 2 refers to conscious, deliberate, logical, reasoning -- slow thinking. This chapter tries to relate this idea to the study of Search systems such as quick retrieval and slow reasoning over retrieved context.

Scientific Papers as Cellular Automata

I recently read a really interesting survey paper about Cellular Automata and Self-Organizing systems. I first became aware of this idea from the Distil publication and their awesome animations of re-generation with Cellular Automata. I was thinking about how language models similarly try to recover from the damage of self-supervised masking. This short video presents the idea of local damage recovery with global message passing. This idea isn't very well developed, but hopefully there is something of interest in there for you. I like this idea lot, but again, and am still not sure how to really bring it to life.

Robustness to Question “Style”

Robustness to the somewhat esoteric concept of "style" was well explored in Computer Vision with the construction of the "Stylized ImageNet" dataset. This dataset was used to show things like the bias towards texture rather than a more human, shape bias. The concept of "style" has similarly been used in images to render a photograph of a dog as if it was painted by Vincent van Gogh. This chapter is focused on the analog of "style" to Natural Language Processing. For example, people have a different style of asking questions to each other compared to a search engine like Google or Bing. I think there are a lot of interesting ideas to this and the notion of style transfer between casual and formal text. I hope you find this chapter interesting!

Multimodal Search

Deep Learning has made remarkable successes processing data domains such as images, text, audio, video, graph-structured, tabular, and more. One of the most exciting emerging applications is the combination of data types for one task. For example, combining images and text for visual question answering and text-based image search. This chapter discusses the idea of using text-based queries to search through tabular descriptions of bicycle models in order to find a local repair shop, as well as generalizing this to image-image search and other ideas to help us better connect with our local communities through the use of more targeted search!

Robustness and Compositional Generalization

The discussion around categories of Generalization has been very exciting. We are used to evaluating these models with independent and identically distributed (i.i.d.) train-test splits. However, this doesn't really capture the Distribution Shift that happens from train to test sets in real-world deployment or the kind of behavior we are trying to achieve. This video outlines two sides of Generalization that I think are really interesting. Robustness has obvious implications for these systems, probably most vividly communicated with self-driving cars and corruption tests like artificially adding rain or snow to an image. Compositional generalization is probably the more exciting one, especially with ideas like text-to-image generation.

Democratic AI through General Purpose Readers

Democratic AI is an important goal to enable entrepreneurship and the development of AI technology. Particularly what we mean by this is generally overcoming bottlenecks of needing very expensive computers, massive private datasets, and long training times to get started with Deep Learning. I think that the decomposition of Retrieve-then-Read can be very promising for overcoming this bottleneck, in addition to ideas around efficient training. For those interested, I highly recommend checking out the "methods" outlined on the MosaicML website. They are a leading research lab doing amazing work on efficiency in Deep Learning, which has additional implications such as limiting climate damage from these systems.

Stripe for Federated Learning

This video quickly presents the idea of having some kind of trusted 3rd party library to handle data privacy for Deep Learning applications. The current state of Deep Learning applications are very data heavy, although recent advances such as Data Augmentation and Transfer Learning may circumvent that bottleneck. Federated Learning is a promising solution to data privacy and Deep Learning. In this framework, only the model weights are sent to a local user and the updates are done within the local machine. Only the parameters are traded globally, rather than having a central data store with sensitive user data. Federated Learning seems to be the leading technique for this -- Differentiable Privacy seems to be the next step for this. I don't know too much about it, but I recommend checking out Andrew Trask's videos on the topic and OpenMined. In addition to these techniques, Dataset Distillation may be the solution we are after. This is the idea of optimizing a compressed representation of the dataset, similar to Generative Teaching Networks, such that you do not need to store the full dataset size and the optimized images/text/etc. aren't a subset of the original data and it would be very difficult to invert them. I hope you find this quick video interesting, I expect the development of something like this to have a very large impact on Deep Learning.

Conclusion, Thanks for reading!

Thank you so much for checking out this first substack publication. I hope you find this useful in addition to the podcast chapters to organize these things. My goal for this newsletter is to keep readers updated on miscellaneous developments with Henry AI Labs. Recently, I have been very excited about the Weaviate Vector Search Engine and I hope you share that interest! Thanks again!

Henry AI Labs