Big Data Platform – Mental Models 4 Life

Data Security vs Cyber Security

July 20, 2025

Cyber security and data security are closely related concepts that operate at different levels and provide different safeguards. Cyber security is primarily about controlling access to systems and data through different security protection mechanisms, from the physical network layer all the way to the application layer. These security mechanisms come primarily in the form of … More Data Security vs Cyber Security

Secure and Ephemeral AI Workloads in Data Mesh Environments

June 3, 2025

A colleague and I have just released on arXiv a paper titled “Enabling Secure and Ephemeral AI Workloads in Data Mesh Environments”. The key innovation is in pushing the now well-established idea of minimal immutable data structures up and down the software infrastructure stack a bit further than what others have done, resulting in a … More Secure and Ephemeral AI Workloads in Data Mesh Environments

Puzzles and Mysteries in Generative AI

May 13, 2025

Of the many questions we wish to answer using LLMs, it can be useful to distinguish between puzzles and mysteries. As Gregory Teverton explained in his many articles, a puzzle is a problem that has a definite and verifiable answer, but a mystery is one that “poses a question that has no definitive answer because … More Puzzles and Mysteries in Generative AI

DeepSeek and All That

February 10, 2025

I wrote the following comments very quickly on LinkedIn about 10 days ago in the middle of the DeepSeek frenzy, and the post turned out to be my second most-read post ever and people continues to look at it. I think the comments are holding up well so I am sharing them here as well, … More DeepSeek and All That

Winners and Losers in the AI Commercial Landscape

December 3, 2024

With NVIDIA seemingly steaming ahead in their latest quarterly result, Apple Intelligence receiving a lukewarm response from users, Wall Street increasingly worried about the return-on-investment from the hyperscalers’ massive capital investments, stories that CIOs are struggling to find ROI for AI, and news in the last two days that Intel and Samsung are both struggling … More Winners and Losers in the AI Commercial Landscape

How To Deal with Database Reconstruction Attacks

March 9, 2024

I have been thinking about data security issues, in particular database-reconstruction attacks. To quote Wikipedia, a reconstruction attack is any method for partially reconstructing a private database from public aggregate information. The question I am specifically interested in is this: Can an attacker with general interactive query access to a dataset recover a piece of … More How To Deal with Database Reconstruction Attacks

Influence Flower

December 25, 2023

Regular users of arXiv.org may have noticed that on every paper’s page, under the Related Papers tab, one can now find the paper’s Influence Flower, which is a nice way to visualise citation influences among academic entities, including papers, authors, institutions, and research topics. The following, for example, are the author-centric and venue-centric influence flowers … More Influence Flower

A Note on Large Scale Data Matching and Entity Resolution

April 6, 2022

Data matching and entity resolution is a common first step in data preparation and there is a thousand academic papers written on the subject in the literature. In practice, for large datasets – anything more than a million records will do as a definition of large here because most data-matching algorithms can’t handle that because … More A Note on Large Scale Data Matching and Entity Resolution

Large-Scale Distributed Analytics: A Research Program

April 15, 2018

Since starting my part-time appointment as an associate professor at the Australian National University, I have been thinking about spending more time on fundamental research. As Don Knuth counsels, “if you find that you’re spending almost all your time on theory, start turning some attention to practical things; it will improve your theories. If you … More Large-Scale Distributed Analytics: A Research Program

Scalable Entity Resolution Using Probabilistic Signatures on Parallel Databases

January 1, 2018

My colleagues and I have just published on arXiv a simple but highly effective Entity Resolution algorithm that can scale to billions of records and handle significant data quality issues. The paper is titled Scalable Entity Resolution Using Probabilistic Signatures on Parallel Databases and it is an extension of our previous paper on linking millions of addresses … More Scalable Entity Resolution Using Probabilistic Signatures on Parallel Databases