Documentation: DataScience_Utils (dsx)

class dsx.ds_utils.dsx(pandas_obj)

The dsx module (same name but not to confuse with the package name) contains a collection of wrapper functions to simplify common operations in data analytics tasks. The core module ds_utils (data science utilities) is designed to work with DataFrame in Pandas to simplify common tasks

classmethod activate_lolviz()

Import lolviz package as lz. Add graphviz directory to the os.environ[“path”].

Parameters

lolviz_dir (str, optional) –

Return type

lolviz instance

classmethod backup(df, name: str = 'last')

To backup the DataFrame or List (or any object with .copy() method)

Parameters
  • df – DataFrame or List (or any object with .copy() method)

  • name – Name of the backup. To be used to retrieve the data.

Return type

None

bk(bk_name: Optional[str] = None)

To backup the dataframe.

Parameters

bk_name

Return type

None

ci(col, n=1000, func=<function mean>, p=0.05)

Generate ‘n’ bootstrap samples, evaluating func at each resampling. This method returns a function, which can be called to obtain confidence intervals of interest. :param n: sample size for the sampling distribution

(defalt = 1,000)

Parameters
  • func (function, optional) – The statistic functions to be bootstrapped its sampling distribution (default = np.mean())

  • p (float, optional) – p-value for specifyin 2-sided symmetric confidence interval

Returns

Function to be called to obtain confidence intervals of interest. Return 2-sided symmetric confidence interval specified

Return type

function

cols_shift(col_names: Union[str, list], direction: Union[str, int] = 'right')

To shift a list of columns to the left-most or the right-most of the dataframe. Note: there is no “inplace” for this method.

Parameters
  • col_names (str or list) –

  • direction (str or int) – str = ‘left’ or right int = 0 or 1

  • inplace

Returns

df with reordered columns

Return type

pd.core.frame.DataFrame

cols_std(inplace=True, camel=False)

To standardize the names of all columns, to be compatible with iPython. This method removes space and special characterss in the column names. After standardized, the column names can be used as attribute of the DataFrame (with autocomplete) in iPython

Parameters
  • inplace (bool) –

  • camel (bool) –

Returns

Only when inplace parameter is set to False

Return type

pandas.core.frame.DataFrame

convert_dtypes()

To convert dtypes to Pandas 1.0 dtypes and stringify object columns

Return type

pd.core.frame.DataFrame

cumsum(col_name: str) pandas.core.frame.DataFrame

To generates the following using the unique values of a variable: - Count (Raw Count of Records) - Percentage of the values over the total data - Accumulated percentage of the values

Parameters

col_name (str) –

Return type

pd.core.frame.DataFrame

classmethod del_tempfiles(tempdata=False)

Static method: To delete temporary files of the projects.

Parameters

tempdata (bool, optional) – Default is ‘False’. Set to ‘True’ to delete temporary data in ‘data/temp’ directory

Return type

None

static delta_todate(num_yyyyddd)

To convert timedelta to date :param num_yyyyddd:

Return type

datetime.datetime

dump(path: str, compression_level: int = 7)

To dump DataFrame to the project’s data/temp directory

Parameters
  • path (str) –

  • dir (str, optional) – Default = data/temp

  • compression_level (int, optional) –

Return type

None

duplicated(colname_list: Union[str, list], return_dups=False, keep: bool = False) int

To count the duplicated rows, given a list of columns that contain the unique key.

Parameters
  • colname_list (Union[str, list]) –

  • return_dups (bool, optional) – Default = False Set to True to return a tuple containing (count, df_duplicates).

  • keep (bool, optional) –

Returns

Number of Duplicated Rows

Return type

int

get_dfname(set=True)

To get name of the variable.

Only work in iPython.

Parameters

var

Returns

variable_name

Return type

str

static get_varname(var: object)

To get name of the variable.

Only work in iPython.

Parameters

var

Returns

variable_name

Return type

str

info()

To generate the meta-data of the DataFrame. Meta-data includes the following: - Column Names - Missing Count - Missing Percentage - Unique Value Count (nunique) - Unique Value Percentage

Return type

pandas.core.frame.DataFrame

static interactive()

Set InteractiveShell.ast_node_interactivity = “all” Set mpl.use(“module://backend_interagg”) Set plt.ion()

isnull(colname: str) tuple

Count the rows (and the %) of missing values in the specified column

Parameters

colname (str) – Single column name

Returns

(Count of Missing Rows, Percentage of Missing Rows)

Return type

tuple

isnull_list(col_names_list=None) pandas.core.frame.DataFrame

Generate a report of cases with missing values

Parameters

col_names_list (list, optional) – List of columns to be included in the report. If not specified, all columns will be used.

Return type

pandas.core.frame.DataFrame

len_compare(df_to_compare, overwrite_df1=None) tuple

Compare the length of two Dataframes (or any other enumeratable object)

Parameters
  • df_to_compare

  • overwrite_df1 (bool, optional) – To ignore this instance of DataFrame and use the DataFrame in parameter as the copy to be compared.

static matplotlib_config()

Print matplotlib configurations

Returns

lines of texts

Return type

str

merge(right, how='left', on=None, left_on=None, right_on=None, isnull=None) pandas.core.frame.DataFrame

To merge with another DataFrame. A wrapper method for ‘merge’ in pandas, with additional checking mechanisms. The mehtod also creates a backup of the original DataFrame with the key ‘last’ in dsx.backup_repo (dictionary).

Parameters
  • right (pd.core.frame.DataFrame) –

  • isnull (str) –

Return type

pd.core.frame.DataFrame

nunique(col_names_list=None) pandas.core.frame.DataFrame
To generate:
  1. the number of unique values

  2. the percentage of the unique value over the total records (or rows)

Parameters

col_names_list (list) – If not specified, all column names will be used

Return type

pd.core.frame.DataFrame

static plt_labels(percent=False, fontsize=None, color=None, denominator=None)

To insert label for each element in the current axes (last chart created). :param percent: :type percent: bool :param fontsize: :type fontsize: float :param color: :type color: str :param denominator: :type denominator: float

Return type

None

static progress(iterable: collections.abc.Iterable, counter: int) str

To return string template for the progress of a loop operation.

Parameters
  • iterable (Iterable) –

  • counter (int) –

Return type

str

rename(col_index_or_name: Union[str, int], col_name_new, inplace: bool = True)

To rename single column :param col_index_or_name: :param col_name_new: :param inplace:

Returns

renamed_DataFrame – Only if inplace is set to False.

Return type

pd.core.frame.DataFrame

reset_index(index_label: str = 'RID', inplace: bool = True)

To reset index and immediately rename the old ‘index’ to new index_label defined.

Parameters
  • index_label (str, optional) –

  • inplace (bool, optional) –

Returns

ONLY when inplace == False

Return type

pd.core.frame.DataFrame

classmethod restore(name: str = 'last')

To restore the DataFrame or List (or any object with .copy() method)

Parameters
  • df – DataFrame or List (or any object with .copy() method)

  • name – Name of the backup. To be used to retrieve the data.

Return type

Object

rs(bk_name: Optional[str] = None, inplace=True)

To restore the dataframe.

Parameters
  • bk_name

  • inplace

Return type

pandas.core.frame.DataFrame

classmethod set_dirs(root=False)

Set the project root folder.

Parameters

root (bool, optional) – To indicate whether the current active directory is the root or sub-directory of the project

Return type

None

static set_ipython(node_interactivity: str = 'last')

Set ast_node_interactivity in Ipython.core.InteractiveShell

Parameters

node_interactivity (str, optional) – Default is ‘last’. DSX uses ‘all’ if kernel is detected.

Return type

None

classmethod setup_project(root=True, get_xfiles=False, xfiles_url=None, git_files=False)

Setup project directories for new projects. If the directories exist, will not be overwritten.

Parameters
  • root (bool, optional) –

  • get_xfiles (bool, optional) –

  • git_files (bool, optional) –

Return type

None

split(col: str, sep: str, index_label: str = 'RID', drop_innerindex: bool = True, reset_index_inplace: bool = True)

To generate a DataFrame by splitting the values in a string, where the values are separated by a separator character.

This method is improved upon the original split method in pandas. Where there is no separator in a row, the value will still be posted to the newly generated DataFrame as the outputs.

Parameters
  • col (str) –

  • sep (str) –

  • index_label (str) –

  • drop_innerindex (bool) –

  • reset_index_inplace

Return type

pd.core.frame.DataFrame

to_dict(key_col: str, val_col: str) pandas.core.frame.DataFrame

To generate dictionary from two columns :param key_col: :type key_col: str :param val_col: :type val_col: str

Return type

pd.core.frame.DataFrame

to_excel_stringify(dir=None, strings_to_urls_bool=False)

Faster option to export Excel File, with the option to stringify all hyperlinks in the table. :param dir: :param strings_to_urls_bool:

static to_numeric(inputString)

To convert string to numeric :param inputString:

xv(title=None, convert_time=True, width='100%', height='1200', dirhtml='../_temp', dirbase='_temp', **kwargs)
Parameters
  • title (str, Title for the new viewer file.) –

  • convert (bool, Convert datetime dtype to str for display.) –

Indices and tables