Bridging the Gap Between Research and Readability with Marco Hening Tallarico

In the Author Spotlight series, TDS Editors chat with members of our community about their career path in data science and AI, their writing, and their sources of inspiration. Today, we’re thrilled to share our conversation with Marco Hening Tallarico.

Marco is a graduate student at the University of Toronto and a researcher for Risklab, with a deep interest in applied statistics and machine learning. Born in Brazil and having grown up in Canada, Marco appreciates the universal language of mathematics.

What motivates you to take dense academic concepts (like Stochastic Differential Equations) and turn them into accessible tutorials for the broader TDS community?

It’s natural to want to learn everything in its natural order. Algebra, calculus, statistics, etc. But if you want to make fast progress, you have to abandon this inclination. When you’re trying to solve a maze, it’s cheating to pick a place in the middle, but in learning, there is no rule. Start at the end and work your way back if you like. It makes it less tedious.

Your Data Science Challenge article focused on spotting data leakage in code rather than just theory. In your experience, which silent leak is the most common one that still makes it into production systems today?

It’s really easy to let data leakage seep in during data analysis, or when using aggregates as inputs to the model. Especially now that aggregates can be computed in real time relatively easily. Before graphing, before even running the .head() function, I think it’s important to make the train-test split. Think about how the split should be made, from user level, size, and chronology to a stratified split: there are many choices you can make, and it’s worth taking the time.

Also, when using metrics like average users per month, you need to double-check that the aggregate wasn’t calculated during the month you’re using as your testing set. These are trickier, as they are indirect. It’s not always as obvious as not using black-box data when you’re trying to predict what planes will crash. If you have the black box, it’s not a prediction; the plane did crash.

You mention that learning grammar from data alone is computationally costly. Do you believe hybrid models (statistical + formal) are the only way to achieve sustainable AI scaling in the long run?

If we take LLMs for example, there are a lot of easy tasks that they struggle with, like adding a list of numbers or turning a page of text into uppercase. It’s not unreasonable to think that just making the model larger will solve these problems but it’s not a good solution. It’s a lot more reliable to have it invoke a .sum() or .upper() function on your behalf and use its language reasoning to select inputs. This is likely what the major AI models are already doing with clever prompt engineering.

It’s a lot easier to use formal grammar to remove unwanted artifacts, like the em dash problem, than it is to scrape another third of the internet’s data and perform further training.

You contrast forward and inverse problems in PDE theory. Can you share a real-world scenario outside of temperature modeling where an inverse problem approach could be the solution?

The forward problem tends to be what most people are comfortable with. If we look at the Black Scholes model, the forward problem would be: given some market assumptions, what is the option price? But there is another question we can ask: given a bunch of observed option prices, what are the model’s parameters? This is the inverse problem: it’s inference, it’s implied volatility.

We can also think in terms of the Navier-Stokes equation, which models fluid dynamics. The forward problem: given a wing shape, initial velocity, and air viscosity, compute the velocity or pressure field. But we could also ask, given a velocity and pressure field, what the shape of our airplane wing is. This tends to be much harder to solve. Given the causes, it’s much easier to compute the effects. But if you are given a bunch of effects, it’s not necessarily easy to compute the cause. This is because multiple causes can explain the same observation.

It’s also part of why PINNs have taken off recently; they highlight how neural networks can efficiently learn from data. This opens up a whole toolbox, like Adam, SGD, and backpropagation, but in terms of solving PDEs, it’s ingenious.

As a Master’s student who is also a prolific technical writer, what advice would you give to other students who want to start sharing their research on platforms like Towards Data Science?

I think in technical writing, there are two competing choices that you have to actively make; you can think of it as distillation or dilution. Research articles are a lot like a vodka shot; in the introduction, vast fields of study are summarized in a few sentences. While the bitter taste of vodka comes from evaporation, in writing, the main culprit is jargon. This verbal compression algorithm lets us discuss abstract ideas, such as the curse of dimensionality or data leakage, in just a few words. It’s a tool that can also be your undoing.

The original deep learning paper is 7 pages. There are also deep learning textbooks that are 800 pages (a piña colada by comparison). Both are great for the same reason: they provide the right level of detail for the appropriate audience. To understand the right level of detail, you have to read in the genre you want to publish.

Of course, how you dilute spirits matters; no one wants a 1-part warm water, 1-part Tito’s monstrosity. Some recipes that make the writing more palpable include using memorable analogies (this makes the content stick, like piña colada on a tabletop), focusing on a few pivotal concepts, and elaborating with examples.

But there is also distillation happening in technical writing, and that comes down to “omitt[ing] needless words,” an old saying by Strunk & White that will always ring true and remind you to read about the craft of writing. Roy Peter Clark is a favorite of mine.

You also write research articles. How do you tailor your content differently when writing for a general data science audience versus a research-focused one?

I would definitely avoid any alcohol-related metaphors. Any figurative language, in fact. Stick to the concrete. In research articles, the main thing you need to communicate is what progress has been made. Where the field was before, and where it is now. It’s not about teaching; you assume the audience knows. It’s about selling an idea, advocating for a method, and supporting a hypothesis. You have to show how there was a gap and explain how your paper filled it. If you can do those two things, you have a good research paper.

To learn more about Marco’s work and stay up-to-date with his latest articles, you can visit his website and follow him on TDS, or LinkedIn.

Source link

#Bridging #Gap #Research #Readability #Marco #Hening #Tallarico