Redaction

Preserving stories, protecting secrets

Jan 12, 2024

woman holding white printer paper — Photo by Crazy Cake on Unsplash

Hey everyone, hope you're all doing great! I'm here with another edition of Git Checkout. It's been a hectic week for me. I planned to get this issue ready early, but time just flew with a lot on my plate. Luckily, I made it happen!

So, this week, an idea about ChatGPT for business reports popped into my head. You know, those reports, filled with financial jargon like profit and loss statements. Wouldn't it be cool to have a GPT that simplifies all that? If anyone wants to take a crack at it, feel free, and I can assist. It could really help people who get lost in accounting terms.

But here's a hitch – I wouldn’t be keen on sharing my company's financials with an open model. That's where the need for an anonymiser comes in. And guess what? Today's focus is on a Python-based anonymiser.

As always, I'll dig into the code, highlight some noteworthy features, and suggest ways you can chip in as an open-source developer. Maybe you'll even learn a couple of Python tricks along the way.

Let's jump in. We're covering:

- The redaction repo

- A thought on redaction

The redaction repo:

Photo by Rubaitul Azad on Unsplash

This repo's got a handy Python tool that helps find and hide personal info in documents - stuff like emails and other private details. I picked this repo because I really liked how it can keep sensitive data safe. With this, I can make sure all the confidential bits are covered up before I use any personal document in ChatGPT. Now, let's dive into some of the cool parts of this tool.

The Repository: anonymiser

PII Detection and Anonymisation Techniques:

Detectors : The detector identifies different types of sensitive data viz: credit card numbers, SSNs, emails, phone numbers. It uses a regular expression to match specific patterns. Studying these files can provide insights into how to construct effective regex patterns for data matching and this could be a good source of practical learning.

Here is a good example of regex pattern to identify email addresses:

def __init__(self):
        self.name = "EMAIL"
        self.pattern = RegEx().one_of("a-zA-Z0-9_.+").one_or_more_occurrences().literal("@").one_of("a-zA-Z0-9-")\
.one_or_more_occurrences().literal("\\.").one_of("a-zA-Z0-9-.").one_or_more_occurrences().build()

The pattern looks for sequences that start with alphanumeric characters (including underscores, dots, and plus signs), followed by the @ symbol, and then the domain part of the email (alphanumeric characters, hyphens, and dots).

Anonymisation : Once this tool finds sensitive data, it uses a process called 'redaction' to hide it. Basically, it takes any private info it finds and erases it, leaving a blank space instead. This method is pretty simple and makes sure that when you use or share your data, you're not giving away any private details. However, sometimes, taking out this info can make the data less useful, especially if you're removing a lot of details.

Here's the code snippet that redacts information:

def redact(text: str, analyzer_results: [AnalyzerResult]):
      for result in analyzer_results:
             text = text.replace(result.text, "")
      return text

The `redact` method works like a filter for sensitive information. It takes in two parameters: the original text where we might have private details, and a list of findings (we call these `analyzer_results`) that point out exactly where the sensitive bits are in that text. For each of these sensitive bits found, the method erases them from the text. So, in the end, you get the same text back, but without any of the private details.

A thought on redaction:

"Redaction is the thoughtful art of omission, ensuring the story is told while the secrets remain untold." - Anonymous

When I read that quote, it reminded me of being careful when telling stories. It's about sharing the fun parts but keeping the private stuff out, so the story is still good without spilling any secrets.

Share Git Checkout!

Have a good one,

—Krish

Git Checkout!

Redaction

Preserving stories, protecting secrets

The redaction repo:

A thought on redaction:

Discussion about this post