sanityze.cleanser

Module Contents

Classes

Cleanser

The main class for the sanityze package. It's purpose is to clean the data frame

class sanityze.cleanser.Cleanser(include_default_spotters=True, hash_spotted=False)[source]

The main class for the sanityze package. It’s purpose is to clean the data frame before it’s consumed by the training or prediction pipeline.

Parameters:
  • include_default_spotters (bool, optional) – If True, the default spotters will be added to the Cleanser. The default is True.

  • hash_spotted (bool, optional) – If True, the spotters will hash the values within the columns they spot. The default is False.

add_spotter(spotter) bool[source]

Add a specific spotter to the Cleanser

Parameters:

spotter (Spotter) – A subclass of Spotter to add to the Cleanser. Note that spotters are added at the end of the list. Adding the same spotter will return False

Return type:

True if the spotter was added, False if it was not added.

Examples

>>> c = Cleanser(include_default_spotters=False)
>>> s1 = EmailSpotter("EMAILS",True)
>>> c.add_spotter(s1)
remove_spotter(spotter_id) bool[source]

Remove a specific spotter from the Cleanser using the spotter’s id

Parameters:
  • spotter_id (str) – The id of the spotter to remove

  • verbose (bool, optional) – If True, the spotter will print out debug information. The default is False.

Return type:

True if the spotter was removed, False if it was not removed.

Examples

>>> c = Cleanser(include_default_spotters=False)
>>> s1 = EmailSpotter("EMAILADDRS",True)
>>> c.remove_spotter("EMAILADDRS")
_log(message: str, verbose: bool) None[source]

Internal utility function to log messages to the console

Parameters:
  • verbose (bool) – The verbosity of the log

  • message (str) – The message to log

Return type:

None

Examples

(called by clean())

clean(df: pandas.DataFrame, verbose=False) pandas.DataFrame[source]

Sanitizes the data frame using the spotters added to the Cleanser

Parameters:

df (pd.DataFrame) – The data frame to sanitize

Return type:

The sanitized data frame

Examples

>>> df = pd.DataFrame(data = {'product_name': ['laptop', 'printer foo@gaga.com', 'tablet', 'desk 5555 5555 5555 4444', 'chair'],
                            'price': [1200, 150, 300, 450, 200]})
>>> c = Cleanser()
>>> c.clean(df, verbose=False)
    product_name        price
0       laptop  1200
1       printer EMAILADDRS      150
2       tablet  300
3       desk 5555 5555 5555 4444        450
4       chair   200