Spike Jonze’s Hera 2013 film about a lonely writer who falls in love with an advanced operating system, explores a poignant relationship between a human and an Artificial Intelligence entity. The idea seemed far-fetched with the Joaquin Phoenix-helmed film depicting a fictional world where the AI ​​entity Samantha, voiced by Scarlett Johansson, can feel emotion, as robots portrayed in the movies do not reflect real life.

In reality, they aren’t as advanced and the failure rate for performing basic tasks is much higher. However, a Pittsburgh-based robotics startup aims to build the world’s largest foundational robotics model, making robots more functional and intelligent. Founded in 2023 by Abhinav Gupta and Deepak Pathak, IIT-Kanpur graduates and Carnegie Mellon University professors, Skild AI is creating a general-purpose brain for robots that can outperform the individual hand-designed systems currently deployed.

Pathak says the idea for Skild AI was sparked by the shift in robotics from lab demos to live demonstrations, indicating a readiness for real-world applications. Pathak and his team saw the potential to bring robots into everyday tasks, leading to the founding of the startup.

Simplifying robotics

Pathak, who has a PhD in Artificial Intelligence from the University of California, Berkeley, argues that robotics faces a chicken-and-egg problem due to a lack of data compared to vision and language models. This is why the company is building the world’s largest foundational model for robotics at scale. Simply put, it is designing a brain trained on a larger database than any other model in robotics.

“A robot is essentially a collection of motors. Each motor takes electricity as input and uses it to move the robot. You can’t use English data to determine how much power to apply to the robot at each point. Instead, you need to understand how much power to apply, how to coordinate the robot, how to handle failures, how the robot can recover, and how to ensure safety,” Pathak tells indianexpress.com, explaining why the company thought it was necessary. to create a foundational model — or a ‘general-purpose brain’ to train robots.

Festive offer
Boston Dynamics. Tesla’s Optimus robot can be seen at the company’s store in the Meatpacking District, New York. (Image: Anuj Bhatia/The Indian Express)

“Language models, like those from OpenAI and Google, are trained on patterns of word occurrence. For example, they learn that words like ‘microwave’ often appear with terms related to appliances. They analyze the distribution of words and how they follow each other. However, this does not mean that the models understand how to perform tasks, such as opening a microwave. They lack the knowledge of specific actions, like the angles needed to apply force or the physical mechanics involved,” Gupta chimes in, adding that a robotics foundational model directly maps objects to actions, similar to how humans learn through physical interactions. “This is something that does not exist in the LLM world,” he adds.

“You have to figure out how to extend your arm, how to close your fingers, how to grasp a pan, and how to tear a tea bag apart to take the tea out. These actions require precise coordination of your fingers and the ability to observe and interact with the physical world. This knowledge is not innate; it is something you acquire through years of experience and interaction with your environment. In contrast, language models are trained primarily by reading text from sources like Wikipedia. They understand high-level concepts, such as ‘you should do this and then do that,’ but they lack practical understanding. For instance, they don’t know what it actually means to execute these actions or how to handle situations like your hand slipping on the pan and how to recover from it,” says Gupta, highlighting the core difference between the foundational models (LLMs). used in ChatGPT and the approach Skilled AI is taking for robotics.

Pathak and Gupta, former AI researchers at Facebook, attribute the gap to a hardware-focused approach and a lack of scalable data, which is why robots fail at many of the simple tasks they are given. However, what Pathak and Gupta are trying to solve is to simplify robotics rather than make it more complex, as has been the case for years.

Testing real-world scenarios

“Robotics is not a hardware problem but a software problem,” says Gupta, who has a PhD in Computer Science from the University of Maryland and worked as a part-time faculty advisor on computer vision and large-scale visual learning projects at Google.

He argues that the core issue in robotics can be solved by creating a single intelligence that can be fitted into different robots, regardless of form factors, to enable basic functions such as climbing steep slopes, walking over objects obstructing its path, and identifying and picking up items. This approach is similar to how language models have evolved from customized solutions to large-scale, general-purpose models.

The company is training its AI model on a massive database of text, images, and video, which it claims is 1,000 times larger than those used by its competitors. “The idea is that the brain can make any kind of robot work — indoors, outdoors, across various tasks and scenarios — seamlessly,” says Pathak.

Skild AI has seen success across different form factors, including humanoids performing tasks in the kitchen. (Image generated using Meta AI) Skild AI has seen success across different form factors, including humanoids performing tasks in the kitchen. (Image generated using Meta AI)

“If you look at any kind of robot, as well as animals or humans, we are all operating in a common, shared physical world. This means knowledge can be shared across different systems to provide general-purpose behavior across all of them. Since the underlying knowledge is already shared, we capitalize on that,” Pathak responds when asked how Skild’s foundation model can be plugged into different robots.

The startup works with various hardware partners and has seen success across different form factors, including humanoids performing tasks in kitchens, robots traversing mountains, and robotic arms handling manipulation tasks. The testing involves real-world scenarios to validate the model’s accuracy and performance. Pathak says the company plans to continue to develop and refine its model through collaboration with hardware partners and real-world testing.

“Action and vision are the foundations of intelligence in the human brain, which is exactly how we are approaching this. We are building our model to be grounded in the physical world through action and vision, with alignment occurring afterwards. This approach is akin to developing our intelligence through large-scale practice in the real world,” Pathak said.

The AI ​​startup recently raised $300 million at a $1.5 billion valuation in a Series A funding round led by Lightspeed Ventures, SoftBank, Coatue Management, and Amazon founder Jeff Bezos, with participation from CRV, Felicis Ventures, Menlo Ventures, Amazon, Sequoia Capital, General Catalyst, SV Angel, and Carnegie Mellon University.

The Skild AI team includes robotics and AI experts from Meta, Tesla, Nvidia, Amazon, Google, and the top schools including Carnegie Mellon University, Stanford, and UC Berkeley. The company has offices in Pittsburgh and San Francisco and is currently hiring in India.