Hey everyone, hope you're all doing great! I'm here with another edition of Git Checkout. It's been a hectic week for me. I planned to get this issue ready early, but time just flew with a lot on my plate. Luckily, I made it happen!
So, this week, an idea about ChatGPT for business reports popped into my head. You know, those reports, filled with financial jargon like profit and loss statements. Wouldn't it be cool to have a GPT that simplifies all that? If anyone wants to take a crack at it, feel free, and I can assist. It could really help people who get lost in accounting terms.
But here's a hitch – I wouldn’t be keen on sharing my company's financials with an open model. That's where the need for an anonymiser comes in. And guess what? Today's focus is on a Python-based anonymiser.
As always, I'll dig into the code, highlight some noteworthy features, and suggest ways you can chip in as an open-source developer. Maybe you'll even learn a couple of Python tricks along the way.
Let's jump in. We're covering:
- The redaction repo
- A thought on redaction
The redaction repo:
This repo's got a handy Python tool that helps find and hide personal info in documents - stuff like emails and other private details. I picked this repo because I really liked how it can keep sensitive data safe. With this, I can make sure all the confidential bits are covered up before I use any personal document in ChatGPT. Now, let's dive into some of the cool parts of this tool.
The Repository: anonymiser
PII Detection and Anonymisation Techniques:
Detectors : The detector identifies different types of sensitive data viz: credit card numbers, SSNs, emails, phone numbers. It uses a regular expression to match specific patterns. Studying these files can provide insights into how to construct effective regex patterns for data matching and this could be a good source of practical learning.
Here is a good example of regex pattern to identify email addresses:
def __init__(self):
self.name = "EMAIL"
self.pattern = RegEx().one_of("a-zA-Z0-9_.+").one_or_more_occurrences().literal("@").one_of("a-zA-Z0-9-")\
.one_or_more_occurrences().literal("\\.").one_of("a-zA-Z0-9-.").one_or_more_occurrences().build()
The pattern looks for sequences that start with alphanumeric characters (including underscores, dots, and plus signs), followed by the @ symbol, and then the domain part of the email (alphanumeric characters, hyphens, and dots).
Anonymisation : Once this tool finds sensitive data, it uses a process called 'redaction' to hide it. Basically, it takes any private info it finds and erases it, leaving a blank space instead. This method is pretty simple and makes sure that when you use or share your data, you're not giving away any private details. However, sometimes, taking out this info can make the data less useful, especially if you're removing a lot of details.
Here's the code snippet that redacts information:
def redact(text: str, analyzer_results: [AnalyzerResult]):
for result in analyzer_results:
text = text.replace(result.text, "")
return text
The `redact` method works like a filter for sensitive information. It takes in two parameters: the original text where we might have private details, and a list of findings (we call these `analyzer_results`) that point out exactly where the sensitive bits are in that text. For each of these sensitive bits found, the method erases them from the text. So, in the end, you get the same text back, but without any of the private details.
Possible contributions:
Contributing to an open source repository can be an excellent way to engage with a project. Here are some potential areas of contribution I could think of.
Enhancing Anonymisation Techniques: Adding new ways to hide private info in this tool could really improve it. There are some cool methods like data masking (where you partly cover up the info), pseudonymisation (using fake names), or differential privacy (slightly changing the data) that could be fun to try out.
Data Masking: Think about a document that has important stuff like phone numbers on it. Data masking is like using a marker to cover up the actual numbers. So, a phone number like "123-456-7890" changes to something like "XXX-XXX-7890". You can still tell it's a phone number, but you can't see the real digits.
Pseudonymisation: This is like giving someone a nickname that has no obvious connection to their real name. For instance, if someone's name is "John Smith", in the data, he might be referred to as "Blue Rabbit". This way, without additional information, you can't tell who "Blue Rabbit" really is. It protects John's real identity while allowing us to use the data for analysis or research.
Differential Privacy: Think of this as adding a bit of "noise" or "fuzziness" to the data. It's like telling a story but changing some small details each time. For example, if a patient is 30 years old, differential privacy might change it randomly to 29 or 31. This alteration is small ("noise") but enough to prevent someone from identifying the patient while still keeping the overall age distribution useful for analysis.
Adding New Detectors: Another idea could be to introduce additional sensitive information detectors, such as names, addresses, or different national ID formats and improve existing detectors with more sophisticated pattern matching or machine learning-based NER (Named Entity Recognition) approaches.
Creating an address Detector: Let's say an address looks like "123 Main St" or "456 Elm St, Springfield, 12345". We could write a set of rules (in technical terms, these are called regular expressions) that can match these patterns. A basic rule might look for a sequence of numbers (for the house number), followed by words (for the street name), and optionally more details like a city or zip code.
Named Entity Recognition (NER): Think of NER as a smart assistant who listens to conversations and points out when someone mentions specific things like names, email addresses, or phone numbers. It's trained to recognize these kinds of information in a conversation. For example, as it reads through a document, it spots "jane.doe@example.com" and flags it as an email address. It knows this because it's been trained to recognise patterns and structures that typically make up an email address.
Building a User Interface: Finally thinking from a SaaS point of view one could create a web-based or desktop GUI to make the tool more accessible to non-technical users with a dashboard for visualising the detection and anonymisation results.
I hope this rundown has made the main parts of the code easier to understand. It could be used to redact sensitive information in any document using the methodologies above. If you're stuck in a creative block, I'm hoping this might help you overcome it and encourage you to start making something cool or inspire you to make some open-source contributions.
A thought on redaction:
"Redaction is the thoughtful art of omission, ensuring the story is told while the secrets remain untold." - Anonymous
When I read that quote, it reminded me of being careful when telling stories. It's about sharing the fun parts but keeping the private stuff out, so the story is still good without spilling any secrets.
Have a good one,
—Krish