Tuesday, August 16, 2022
SCIENMAG: Latest Science and Health News
No Result
View All Result
  • Login
  • HOME PAGE
  • BIOLOGY
  • CHEMISTRY AND PHYSICS
  • MEDICINE
    • Cancer
    • Infectious Emerging Diseases
  • SPACE
  • TECHNOLOGY
  • CONTACT US
  • HOME PAGE
  • BIOLOGY
  • CHEMISTRY AND PHYSICS
  • MEDICINE
    • Cancer
    • Infectious Emerging Diseases
  • SPACE
  • TECHNOLOGY
  • CONTACT US
No Result
View All Result
Scienmag - Latest science news from science magazine
No Result
View All Result
Home SCIENCE NEWS Ethics

New data science platform speeds up Python queries

July 1, 2021
in Ethics
0
Share on FacebookShare on Twitter

PROVIDENCE, R.I. [Brown University] — Researchers from Brown University and MIT have developed a new data science framework that allows users to process data with the programming language Python — without paying the “performance tax” normally associated with a user-friendly language.

The new framework, called Tuplex, is able to process data queries written in Python up to 90 times faster than industry-standard data systems like Apache Spark or Dask. The research team unveiled the system in research presented at SIGMOD 2021, a premier data processing conference, and have made the software freely available to all.

“Python is the primary programming language used by people doing data science,” said Malte Schwarzkopf, an assistant professor of computer science at Brown and one of the developers of Tuplex. “That makes a lot of sense. Python is widely taught in universities, and it’s an easy language to get started with. But when it comes to data science, there’s a huge performance tax associated with Python because platforms can’t process Python efficiently on the back end.”

Platforms like Spark perform data analytics by distributing tasks across multiple processor cores or machines in a data center. That parallel processing allows users to deal with giant data sets that would choke a single computer to death. Users interact with these platforms by inputting their own queries, which contain custom logic written as “user-defined functions” or UDFs. UDFs specify custom logic, like extracting the number of bedrooms from the text of a real estate listing for a query that searches all of the real estate listings in the U.S. and selects all the ones with three bedrooms.

Because of its simplicity, Python is the language of choice for creating UDFs in the data science community. In fact, the Tuplex team cites a recent poll showing that 66% of data platform users utilize Python as their primary language. The problem is that analytics platforms have trouble dealing with those bits of Python code efficiently.

Data platforms are written in high-level computer languages that are compiled before running. Compilers are programs that take computer language and turn it into machine code — sets of instructions that a computer processor can quickly execute. Python, however, is not compiled beforehand. Instead, computers interpret Python code line by line while the program runs, which can mean far slower performance.

“These frameworks have to break out of their efficient execution of compiled code and jump into a Python interpreter to execute Python UDFs,” Schwarzkopf said. “That process can be a factor of 100 less efficient than executing compiled code.”

If Python code could be compiled, it would speed things up greatly. But researchers have tried for years to develop a general-purpose Python compiler, Schwarzkopf says, with little success. So instead of trying to make a general Python compiler, the researchers designed Tuplex to compile a highly specialized program for the specific query and common-case input data. Uncommon input data, which account for only a small percentage of instances, are separated out and referred to an interpreter.

“We refer to this process as dual-case processing, as it splits that data into two cases,” said Leonhard Spiegelberg, co-author of the research describing Tuplex. “This allows us to simplify the compilation problem as we only need to care about a single set of data types and common-case assumptions. This way, you get the best of two worlds: high productivity and fast execution speed.”

And the runtime benefit can be substantial.

“We show in our research that a wait time of 10 minutes for an output can be reduced to a second,” Schwarzkopf said. “So it really is a substantial improvement in performance.”

In addition to speeding things up, Tuplex also has an innovative way of dealing with anomalous data, the researchers say. Large datasets are often messy, full of corrupted records or data fields that don’t follow convention. In real estate data, for example, the number of bedrooms could either be a numeral or a spelled-out number. Inconsistencies like that can be enough to crash some data platforms. But Tuplex extracts those anomalies and sets them aside to avoid a crash. Once the program has run, the user then has the option of repairing those anomalies.

“We think this could have a major productivity impact for data scientists,” Schwarzkopf said. “To not have to run out to get a cup of coffee while waiting for an output, and to not have a program run for an hour only to crash before it’s done would be a really big deal.”

###

The research was supported by the National Science Foundation (DGE-2039354, IIS-1453171) and the U.S. Air Force (FA8750-19-2-1000).

Media Contact
Kevin Stacey
[email protected]
https://www.brown.edu/news/2021-06-24/tuplex

Tags: Computer ScienceSoftware EngineeringTechnology/Engineering/Computer ScienceTheory/Design
Share25Tweet16Share4ShareSendShare
  • Amanda Poholek, Ph.D.

    Reinvigorating ‘lost cause’ exhausted T cells could improve cancer immunotherapy

    139 shares
    Share 56 Tweet 35
  • The best way to take pills according to science

    66 shares
    Share 26 Tweet 17
  • Nuclear war would cause a global famine and kill billions, Rutgers-led study finds

    66 shares
    Share 26 Tweet 17
  • 1 in 3 parents worry that school traffic is a danger for kids

    65 shares
    Share 26 Tweet 16
  • The North American Menopause Society releases its 2022 Hormone Therapy Position Statement

    149 shares
    Share 60 Tweet 37
  • New chip could make treating metastatic cancer easier and faster

    64 shares
    Share 26 Tweet 16
ADVERTISEMENT

About us

We bring you the latest science news from best research centers and universities around the world. Check our website.

Latest NEWS

Reinvigorating ‘lost cause’ exhausted T cells could improve cancer immunotherapy

Experts optimistic about converting coal plants to production of clean geothermal energy

A new method boosts wind farms’ energy output, without new equipment

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 193 other subscribers

© 2022 Scienmag- Science Magazine: Latest Science News.

No Result
View All Result
  • HOME PAGE
  • BIOLOGY
  • CHEMISTRY AND PHYSICS
  • MEDICINE
    • Cancer
    • Infectious Emerging Diseases
  • SPACE
  • TECHNOLOGY
  • CONTACT US

© 2022 Scienmag- Science Magazine: Latest Science News.

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In