Attaquer un système d'IA

February 17, 2023

Quelques notes personnelles sur l’article de la CNIL intitulé “Petite taxonomie des attaques des systèmes d’IA” (disponible ici).

ArXiv musings - 2023, week 6

February 10, 2023

(It’s been a while.) Here is a summary of a few papers that caught my eye this week.

ArXiv musings - 2022, week 39

September 30, 2022

(It’s been a while.) Here is a summary of a few papers that caught my eye this week.

ArXiv musings - 2022, week 35

August 29, 2022

Here is a summary of a few papers that caught my eye this week.

Les risques liés à la confidentialité

August 22, 2022

Cet article fait partie d’une série sur la confidentialité des données¹. Il fait suite à l’article d’introduction et de motivation du problème Qu’est-ce que la confidentialité ?.

Dans l’article précédent, je définissais (de manière très imprécise !) la confidentialité comme la garantie qu’un jeu de données ne permette pas d’obtenir des informations sensibles sur les personnes qui en font partie, et je précisais que si un jeu de données est confidentiel, je ne cours aucun risque à ce que mes données soient dedans.

Mais quels sont ces risques ?

Cette série utilise énormément les deux ouvrages The ethical algorithm, par Roth et Kearns, et The algorithmic foundations of differential privacy, par Dwork et Roth. ↩

ArXiv musings - 2022, week 33

August 18, 2022

Here is a summary of a few papers that caught my eye this week.

Why do tree-based models still outperform deep learning on tabular data?

This paper by three French researchers proposes a new, extensive benchmark procedure for machine learning models on structured (tabular) data. The dataset is made of 45 different tabular datasets, some with numerical-only, and some with mixed features. The benchmark procedures rates the performance of the algorithms using accuracy (for classification) and \(R^2\) (for regression) after an increasing number of iteration of a random hyperparameter search.

Qu'est-ce que la confidentialité ?

August 17, 2022

Cet article fait partie d’une série sur la confidentialité des données, que j’écris dans le désordre. Celui-ci fait office d’introduction et de motivation du problème.

Confidentialité et anonymat

August 05, 2022

Cela fait un moment que j’ai envie d’écrire ce que je sais sur la confidentialité différentielle. Je reprends ici la plupart des explications du chapitre 1 de The ethical algorithm, de Kearns et Roth.

Dans ce premier article, je décris la motivation pour une définition rigoureuse de la confidentialité, et je donne une première définition (imparfaite) de ce qu’on peut attendre d’un jeu de données anonyme.

Dall-E and bias

August 03, 2022

I have been reading a lot about bias and fairness in AI recently, and one example in particular caught my eye: DALL-E.

DALL-E

DALL-E is a transformers model developed by openAI to generate images from text prompts. It is based on a modified version of GPT-3. DALL-E was originally released in January 2021, and its successor, DALL-E 2, was announced in April 2022.

Like all AI models (except some that might be specifically tuned to avoid this pitfall), DALL-E reflects the biases in its training data. Ask it to represent a lawyer, and you’ll get pictures of a grey-haired white male. If you ask for a nurse instead, all the pictures will represent women. “Convicted criminal” will skew heavily non-caucasian, while “police officer” will be, again, all-white. Although the model is built for fun and games, this perpetuates harmful stereotypes and needs addressing.

ArXiv musing: 2022, week 31

August 02, 2022

Here are a few papers of interest I found in my arXiv feed. I’ll add more if there are more this week.

Jazz Contrafact Detection

A contrafact is a melody that shares the same underlying chord progression as another melody, sometimes reharmonized. The authors propose a way to detect whether a melody is a contrafact of another, using music theory to inform chord vector embedding.