sanityze.cleanser¶
Module Contents¶
Classes¶
The main class for the sanityze package. It's purpose is to clean the data frame |
- class sanityze.cleanser.Cleanser(include_default_spotters=True, hash_spotted=False)[source]¶
The main class for the sanityze package. It’s purpose is to clean the data frame before it’s consumed by the training or prediction pipeline.
- Parameters:
include_default_spotters (bool, optional) – If True, the default spotters will be added to the Cleanser. The default is True.
hash_spotted (bool, optional) – If True, the spotters will hash the values within the columns they spot. The default is False.
- add_spotter(spotter) bool[source]¶
Add a specific spotter to the Cleanser
- Parameters:
spotter (Spotter) – A subclass of Spotter to add to the Cleanser. Note that spotters are added at the end of the list. Adding the same spotter will return False
- Return type:
True if the spotter was added, False if it was not added.
Examples
>>> c = Cleanser(include_default_spotters=False) >>> s1 = EmailSpotter("EMAILS",True) >>> c.add_spotter(s1)
- remove_spotter(spotter_id) bool[source]¶
Remove a specific spotter from the Cleanser using the spotter’s id
- Parameters:
spotter_id (str) – The id of the spotter to remove
verbose (bool, optional) – If True, the spotter will print out debug information. The default is False.
- Return type:
True if the spotter was removed, False if it was not removed.
Examples
>>> c = Cleanser(include_default_spotters=False) >>> s1 = EmailSpotter("EMAILADDRS",True) >>> c.remove_spotter("EMAILADDRS")
- _log(message: str, verbose: bool) None[source]¶
Internal utility function to log messages to the console
- Parameters:
verbose (bool) – The verbosity of the log
message (str) – The message to log
- Return type:
None
Examples
(called by clean())
- clean(df: pandas.DataFrame, verbose=False) pandas.DataFrame[source]¶
Sanitizes the data frame using the spotters added to the Cleanser
- Parameters:
df (pd.DataFrame) – The data frame to sanitize
- Return type:
The sanitized data frame
Examples
>>> df = pd.DataFrame(data = {'product_name': ['laptop', 'printer foo@gaga.com', 'tablet', 'desk 5555 5555 5555 4444', 'chair'], 'price': [1200, 150, 300, 450, 200]}) >>> c = Cleanser() >>> c.clean(df, verbose=False) product_name price 0 laptop 1200 1 printer EMAILADDRS 150 2 tablet 300 3 desk 5555 5555 5555 4444 450 4 chair 200