Preppin' Data 2020: Week 2 challenge solution
Here is my solution for Peppin’ Data 2020, Week 2. The goal for this challenge was to clean and format time entries to make them fit the 24-hour clock.
Original challenge and data set on Preppin’ Data’s blog.
Step 1: Identifying time entries in incorrect format
I started with adding a Cleaning step after the input and creating a new calculated field called Correct time using the following formula:
This calculation uses Regex or Regular expressions which are sequences of characters that allow you to identify, parse, or replace patterns in string data. There are four Regex functions in Tableau Prep: REGEXP_MATCH, REGEXP_EXTRACT, REGEXP_EXTRACT_NTH, and REGEXP_REPLACE.
If you would like to learn more about how you can use Regex functions in Tableau Prep, I would recommend starting with Carl’s post on Preppin’ Data’s blog.
In this case, I used the REGEXP_MATCH function to check entries in the Time field and see if they match the correct 24-hour time format, i.e. a set of two digits for an hour and a set of two digits for minutes separated by colon. In Regex syntax it is \d\d:\d\d
(where \d
is for a digit from 0 to 9). Once applied, the REGEXP_MATCH function returned ‘True’ for time entries in the correct format, and ‘False’ for time entries that should be fixed. As you can see in the screenshot below, there are 10 entries that should be fixed.
Step 2: Extracting time data from incorrectly formatted time entries
To make the fixing process easier, I filtered out all ‘True’ values from the Correct times field. Then I cleaned the Time field by removing punctuation and making all letters lowercase. After that, I used another Regex function, REGEXP_EXTRACT, to create a new field called Clean time that contained only digits from the Time field. Here the Regex \d+
means one or more digits from 0 to 9.
Step 3: Formatting incorrect time entries
Once I got to this stage, I updated the Clean time field to add a leading zero, if it was missing. For example, there is an entry for January 2nd that reads ‘719a’, but instead should read ‘07:19’. To check whether a zero should be added and add it, I created a calculated field that updated the existing Clean time field. To update an existing field with a new calculation, simply name the calculation with the exact name of the field you would like to update. In this case, it was the Clean time field.
Then I split the updated field Clean time into two new fields: Clean time_hour and Clean time_minute. For this I used another Regex function, REGEXP_EXTRACT_NTH. This function returns a part of the given string (the Clean time field), that is matched to the nth capturing group indicated in parenthesis (two groups of two digits: (\d\d)(\d\d)
), where n is the given index (1 and 2 respectively). For this challenge you can also use other string calculations such as LEFT, MID, and RIGHT to come to the same result.
For the Clean time_hour field
For the Clean time_minute field
I kept the data type for these new fields as string to preserve the leading zero for all time entries before midday.
The final step in this formatting process is to fix (if needed) the number for an hour, depending on whether the hour is before noon. Also, I needed to make sure that midnight is formatted as 00:xx rather than 12:xx. To do so, I created a new calculated field that updated the Clean time_hour field:
After fixing the hour format, I concatenated the hour and minutes fields into a Time new field:
After this calculation, I removed any unnecessary fields and renamed the Time new field into Time. At the end of this step, there were only two fields: Time and Date.
Step 4: Union two tables
Now I could union this updated table with the original dataset. After adding the Union step, I created a calculated field Correct times that checked the format of time entries. The formula is exactly the same as in the beginning of our journey:
Then I simply filtered all rows marked as ‘False’ (there were 10 rows that were originally incorrectly formatted), and got a clean data set with 50 rows of data, exactly the same as in the input file.
To finish the challenge, I removed any unnecessary fields and created a new calculated field to get a combined Date and Time field:
I changed the field’s data format to be of type Date & Time to get the final output as required:
You can download the completed workflow for this challenge in my GitHub repository. Let me know if you have any questions.