O'Reilly Data Show Podcast show

O'Reilly Data Show Podcast

Summary: The O'Reilly Data Show Podcast explores the opportunities and techniques driving big data, data science, and AI.

Join Now to Subscribe to this Podcast

Podcasts:

 A framework for building and evaluating data products | File Type: audio/mpeg | Duration: 00:22:18

In this episode of the Data Show, I spoke with Grace Huang, data science lead at Pinterest. With its combination of a large social graph, enthusiastic users, and multimedia data, I’ve long regarded Pinterest as a fascinating lab for data science. Huang described the challenge of building a sustainable content ecosystem and shared lessons from the front lines of machine learning product launches. We also discussed recommenders, the emergence of deep learning as a technique used within Pinterest, and the role of data science within the company.

 Building a next-generation platform for deep learning | File Type: audio/mpeg | Duration: 00:27:50

In this episode of the Data Show, I speak with Naveen Rao, VP and GM of the Artificial Intelligence Products Group at Intel. In an earlier episode, we learned that scaling current deep learning models requires innovations in both software and hardware. Through his startup Nervana (since acquired by Intel), Rao has been at the forefront of building a next generation platform for deep learning and AI. I wanted to get his thoughts on what the future infrastructure for machine learning would look like. At least for now, we’re seeing a variety of approaches, and many companies are using heterogeneous processors (even specialized ones) and proprietary interconnects for deep learning. Nvidia and Intel Nervana are set to release processors that excel at both training and inference, but as Rao pointed out, at large-scale there are many considerations—including utilization, power consumption, and convenience—that come into play.

 A scalable time-series database that supports SQL | File Type: audio/mpeg | Duration: 00:49:12

In this episode of the Data Show, I spoke with Michael Freedman, CTO of Timescale and professor of computer science at Princeton University. When I first heard that Freedman and his collaborators were building a time-series database, my immediate reaction was: “Don’t we have enough options already?” The early incarnation of Timescale was a startup focused on IoT, and it was while building tools for the IoT problem space that Freedman and the rest of the Timescale team came to realize that the database they needed wasn’t available (at least out in open source). Specifically, they wanted a database that could easily support complex queries and the sort of real-time applications many have come to associate with streaming platforms. Based on early reactions to TimescaleDB, many users concur.

 Programming collective intelligence for financial trading | File Type: audio/mpeg | Duration: 00:26:53

In this episode of the Data Show, I spoke with Geoffrey Bradway, VP of engineering at Numerai, a new hedge fund that relies on contributions of external data scientists. The company hosts regular competitions where data scientists submit machine learning models for classification tasks. The most promising submissions are then added to an ensemble of models that the company uses to trade in real-world financial markets.

 Creating large training data sets quickly | File Type: audio/mpeg | Duration: 00:47:09

In this episode of the Data Show, I spoke with Alex Ratner, a graduate student at Stanford and a member of Christopher Ré’s Hazy research group. Training data has always been important in building machine learning algorithms, and the rise of data-hungry deep learning models has heightened the need for labeled data sets. In fact, the challenge of creating training data is ongoing for many companies; specific applications change over time, and what were gold standard data sets may no longer apply to changing situations.

 Data science and deep learning in retail | File Type: audio/mpeg | Duration: 00:49:28

In this episode of the Data Show, I spoke with Jeremy Stanley, VP of data science at Instacart, a popular grocery delivery service that is expanding rapidly. As Stanley describes it, Instacart operates a four-sided marketplace comprised of retail stores, products within the stores, shoppers assigned to the stores, and customers who order from Instacart. The objective is to get fresh groceries from popular retailers delivered to customers in a timely fashion. Instacart’s goals land them in the center of the many opportunities and challenges involved in building high-impact data products.

 Language understanding remains one of AI’s grand challenges | File Type: audio/mpeg | Duration: 00:38:04

In this episode of the Data Show, I spoke with David Ferrucci, founder of Elemental Cognition and senior technologist at Bridgewater Associates. Ferrucci served as principal investigator of IBM’s DeepQA project and led the Watson team that became champion of the Jeopardy! quiz show. Elemental Cognition (EC) is a research group focused on building an AI system that will be equipped with state-of-the-art natural language understanding technologies. Ferrucci envisions that EC will ship with foundational knowledge in many subject areas, but will be able to very quickly acquire knowledge in other (specialized) domains with the help of “human mentors.” Having built and deployed several prominent AI systems through the years, I also wanted to get Ferrucci’s perspective on the evolution of AI technologies, and how enterprises can take advantage of all the exciting recent developments.

 Data preparation in the age of deep learning | File Type: audio/mpeg | Duration: 00:36:16

In this episode of the Data Show, I spoke with Lukas Biewald, co-founder and chief data scientist at CrowdFlower. In a previous episode we covered how the rise of deep learning is fueling the need for large labeled data sets and high-performance computing systems. CrowdFlower has a service that many leading companies have come to rely on to provide them with labeled data sets to train machine learning models. As deep learning models get larger and more complex, they require training data sets that are bigger than those required by other machine learning techniques.

 Scaling machine learning | File Type: audio/mpeg | Duration: 00:56:47

In this episode of the Data Show, I spoke with Reza Zadeh, adjunct professor at Stanford University, co-organizer of ScaledML, and co-founder of Matroid, a startup focused on commercial applications of deep learning and computer vision. Zadeh also is the co-author of the forthcoming book TensorFlow for Deep Learning (now in early release). Our conversation took place on the eve of the recent ScaledML conference, and much of our conversation was focused on practical and real-world strategies for scaling machine learning. In particular, we spoke about the rise of deep learning, hardware/software interfaces for machine learning, and the many commercial applications of computer vision.

 Architecting and building end-to-end streaming applications | File Type: audio/mpeg | Duration: 00:45:10

In this episode of the Data Show, I spoke with Karthik Ramasamy, adjunct faculty member at UC Berkeley, former engineering manager at Twitter, and co-founder of Streamlio. Ramasamy managed the team that built Heron, an open source, distributed stream processing engine, compatible with Apache Storm.  While Ramasamy has seen firsthand what it takes to build and deploy large-scale distributed systems (within Twitter, he worked closely with the team that built DistributedLog), he is first and foremost interested in designing and building end-to-end applications. As someone who organizes many conferences, I’m all too familiar with the vast array of popular big data frameworks available. But, I also know that engineers and architects are most interested in content and material that helps them cut through the options and decisions.

 Becoming a machine learning engineer | File Type: audio/mpeg | Duration: 00:41:03

In this episode of the Data Show, I spoke with Aurélien Géron, a serial entrepreneur, data scientist, and author of a popular, new book entitled Hands-on Machine Learning with Scikit-Learn and TensorFlow. Géron’s book is aimed at software engineers who want to learn machine learning and start deploying machine learning models in real-world products.

 Natural language analysis using Hierarchical Temporal Memory | File Type: audio/mpeg | Duration: 00:51:04

In this episode of the Data Show, I spoke with Francisco Webber, founder of Cortical.io, a startup that is applying tools based on Hierarchical Temporal Memory (HTM) to natural language understanding. While HTM has been around for more than a decade, there aren’t many companies that have released products based on it (at least compared to other machine learning methods). Numenta, an organization developing open source machine intelligence based on the biology of the neocortex, maintains a community site featuring showcase applications. Webber’s company has been building tools based on HTM and applying them to big text data in a variety of industries; financial services has been a particularly strong vertical for Cortical.

 Saving the world—or at least the world’s scientific and government data | File Type: audio/mpeg | Duration: 00:40:33

In this special episode of the Data Show, O'Reilly's Jenn Webb speaks with Maxwell Ogden, director of Code for Science and Society. Recently, Ogden and Code for Science have been working on the ongoing rescue of data.gov and assisting with other data rescue projects, such as Data Refuge; they’re also the nonprofit developers supporting Dat, a data versioning and distribution manager, which came out of Ogden's work making government and scientific data open and accessible.

 Deep learning that's easy to implement and easy to scale | File Type: audio/mpeg | Duration: 00:36:41

In this episode of the Data Show, I spoke with Anima Anandkumar, a leading machine learning researcher, and currently a principal research scientist at Amazon. I took the opportunity to get an update on the latest developments on the use of tensors in machine learning. Most of our conversation centered around MXNet—an open source, efficient, scalable deep learning framework. I’ve been a fan of MXNet dating back to when it was a research project out of CMU and UW, and I wanted to hear Anandkumar’s perspective on its recent progress as a framework for enterprises and practicing data scientists.

 Building machine learning solutions that can withstand adversarial attacks | File Type: audio/mpeg | Duration: 00:44:47

In this episode of the Data Show, I spoke with Parvez Ahammad, who leads the data science and machine learning efforts at Instart Logic. He has applied machine learning in a variety of domains, most recently to computational neuroscience and security. Along the way, he has assembled and managed teams of data scientists and has had to grapple with issues like explainability and interpretability, ethics, insufficient amount of labeled data, and adversaries who target machine learning models. As more companies deploy machine learning models into products, it’s important to remember there are many other factors that come into play aside from raw performance metrics.

Comments

Login or signup comment.