Week 2 Notes

Created by Prof. Christopher Kinson


Things Covered in this Week’s Notes


Accessing and importing data

One of the first things we do as data engineers is to see, touch, and feel a dataset. Accessing and importing involves knowing a little bit about the data beforehand such as how the data is organized, the number of records and fields, where the records begin, the presence of delimiters, and any end of line specifics.

More details about the background knowledge of data is discussed below.


Exploring file formats and extensions

Structured data

Structured data sets are simply files with records (observations) and fields (columns). The fields are often organized in a particular way either with delimiters or set with fixed-width.

Delimiters are specific characters used to separate fields of information such as commas, colons, spaces, tabs, and even pipes. In Europe, semi-colon is a typical delimiter.

When information is set to a fixed-width, fields can essentially be counted by the eye and are not required to be fixed for all columns; only that each column has its own fixed width.

These structured data sets are usually saved (or named) with a file extension, and the file extension serves as a clue for the way the file is delimited or organized. A file name can have no extensions or even multiple extensions such as .nb.html and .tar.gz.

Here’s a table of common file formats, delimiters, and extensions for data we might work with. What’s represented here are not strict rules.

Format Delimiter Extension
comma-separated , .csv
tab-delimited or tab-separated \t .txt or .tsv
fixed-width field width .dat or .txt
delimited or delimiter-separated | or : or ; .txt or .csv or .dat

Below is a .csv file.

Victim_Name, Age, Race_Ethnicity, City, State, Date_of_Death, Before_Police_Arrived, Officer, Police_Department, Armed_Before_Police_Arrived, Officer_Charged, Officer_Race_Ethnicity
Rayshard Brooks, 27, Black, Atlanta, GA, 06-12-2020, Sleeping in car at Wendy's, Garret Rolfe, Atlanta Police Department, N, Y, White
Dominique Clayton, 32, Black, Oxford, MS, 05-19-2019, Sleeping in bed at home, Matthew Kinne, Oxford Police Department, N, Y, White
Michael Lorenzo Dean, 28, Black, Temple, TX, 12-02-2020, Driving in car on road, Carmen DeCruz, Temple Police Department, N, Y, Latinx
George Floyd, 46, Black, Minneapolis, MN, 05-25-2020, Driving in car from store, Derek Chauvin, Minneapolis Police Department, N, Y, White
Atatiana Jefferson, 28, Black, Fort Worth, TX, 10-12-2019, Babysitting nephew at home, Aaron Dean, Fort Worth Police Department, Y, Y, White
Sean Monterrosa, 22, Latinx, Vallejo, CA, 06-02-2020, In Walgreens parking lot, Jarrett Tonn, Vallejo Police Department, N, N, White
Eric Reason, 38, Black, Vallejo, CA, 11-10-2019, In JJ's Fish and Chicken parking lot, Virgil Thomas, Richmond Police Department, Y, N, Black
Breonna Taylor, 26, Black, Louisville, KY, 03-13-2020, Sleeping in bed at home, Brett Hankison, Louisville Metro Police Department, N, N, White

Data may be organized and delimited and it might be helpful to check that the data we imported honors that organization. One way to do that is to check the descriptor portion of the data. In different programming languages, the descriptor portion might also be known as the data’s contents, structure, or summary. In addition to the descriptor portion, we might just want to see the data by printing it out. Checking both the printout and the descriptor portion can provide a quick verification that the data is imported succesfully (assuming no processing or syntax errors have been detected).

Below is an image of the print out of the .csv file and its descriptor portion.

End of line specifics might play a role in importing the data successfully if the data existed or was created in an operating system (OS) different from the one we are currently operating. One common end of file issue is that files on the Windows OS organize the end of line as a carriage return and line feed (CRLF), while Unix OS organize the end of line as line feed (LF). Thus, you may need to adjust for the different end of line types in your programming language.

Semi-structured data

Semi-structured data may be files that are organized with tags or attributes but the information is in human-readable text; giving rise to its “semi-structure”. Some common examples of semi-structured data are stored in JSON, GeoJSON, XML, and XLSX. These file formats are quite popular for web development and often created for communicating between the web and applications. Due to the semi-structure, accessing and importing these files may require special packages and tools inside (or outside) your programming language.

Below is a JSON data file.

[{"Victim_Name":"Rayshard Brooks","Age":27,"Race_Ethnicity":"Black","City":"Atlanta","State":"GA","Date_of_Death":"06-12-2020","Before_Police_Arrived":"Sleeping in car at Wendy's","Officer":"Garret Rolfe","Police_Department":"Atlanta Police Department","Armed_Before_Police_Arrived":"N","Officer_Charged":"Y","Officer_Race_Ethnicity":"White"},{"Victim_Name":"Dominique Clayton","Age":32,"Race_Ethnicity":"Black","City":"Oxford","State":"MS","Date_of_Death":"05-19-2019","Before_Police_Arrived":"Sleeping in bed at home","Officer":"Matthew Kinne","Police_Department":"Oxford Police Department","Armed_Before_Police_Arrived":"N","Officer_Charged":"Y","Officer_Race_Ethnicity":"White"},{"Victim_Name":"Michael Lorenzo Dean","Age":28,"Race_Ethnicity":"Black","City":"Temple","State":"TX","Date_of_Death":"12-02-2020","Before_Police_Arrived":"Driving in car on road","Officer":"Carmen DeCruz","Police_Department":"Temple Police Department","Armed_Before_Police_Arrived":"N","Officer_Charged":"Y","Officer_Race_Ethnicity":"Latinx"},{"Victim_Name":"George Floyd","Age":46,"Race_Ethnicity":"Black","City":"Minneapolis","State":"MN","Date_of_Death":"05-25-2020","Before_Police_Arrived":"Driving in car from store","Officer":"Derek Chauvin","Police_Department":"Minneapolis Police Department","Armed_Before_Police_Arrived":"N","Officer_Charged":"Y","Officer_Race_Ethnicity":"White"},{"Victim_Name":"Atatiana Jefferson","Age":28,"Race_Ethnicity":"Black","City":"Fort Worth","State":"TX","Date_of_Death":"10-12-2019","Before_Police_Arrived":"Babysitting nephew at home","Officer":"Aaron Dean","Police_Department":"Fort Worth Police Department","Armed_Before_Police_Arrived":"Y","Officer_Charged":"Y","Officer_Race_Ethnicity":"White"},{"Victim_Name":"Sean Monterrosa","Age":22,"Race_Ethnicity":"Latinx","City":"Vallejo","State":"CA","Date_of_Death":"06-02-2020","Before_Police_Arrived":"In Walgreens parking lot","Officer":"Jarrett Tonn","Police_Department":"Vallejo Police Department","Armed_Before_Police_Arrived":"N","Officer_Charged":"N","Officer_Race_Ethnicity":"White"},{"Victim_Name":"Eric Reason","Age":38,"Race_Ethnicity":"Black","City":"Vallejo","State":"CA","Date_of_Death":"11-10-2019","Before_Police_Arrived":"In JJ's Fish and Chicken parking lot","Officer":"Virgil Thomas","Police_Department":"Richmond Police Department","Armed_Before_Police_Arrived":"Y","Officer_Charged":"N","Officer_Race_Ethnicity":"Black"},{"Victim_Name":"Breonna Taylor","Age":26,"Race_Ethnicity":"Black","City":"Louisville","State":"KY","Date_of_Death":"03-13-2020","Before_Police_Arrived":"Sleeping in bed at home","Officer":"Brett Hankison","Police_Department":"Louisville Metro Police Department","Armed_Before_Police_Arrived":"N","Officer_Charged":"N","Officer_Race_Ethnicity":"White"}]

Unstructured data

Unstructured data is often unorganized and in human readable text. What I mean by unstructured data is text data. In text - when we think of documents, papers, journals, books - we think of the writing as structured with an introduction, main thesis, supporting paragraphs, and a conclusion. We often think of the writing as being well-organized with logical flow of ideas, correct punctuation, and limited spelling and grammatical errors. These are true of well-written text in paper or digital copy. But imagine this book or article existing as a dataset. How would you organize it? Would there be records and fields? What would the main information consist of? Any delimiters?

These questions are not easy to answer and are equally difficult for a computer to figure out. Yet, text is so readily available for analysis that new approaches in text mining and natural language processing are ripe for the taking.

Below are two texts.

Angela Davis from the Women's March (2017) transcribed in The Guardian. "We recognize that we are collective agents of history and that history cannot be deleted like web pages. We know that we gather this afternoon on indigenous land and we follow the lead of the first peoples who despite massive genocidal violence have never relinquished the struggle for land, water, culture, their people. We especially salute today the Standing Rock Sioux. The freedom struggles of black people that have shaped the very nature of this country's history cannot be deleted with the sweep of a hand. We cannot be made to forget that black lives do matter. This is a country anchored in slavery and colonialism, which means for better or for worse the very history of the United States is a history of immigration and enslavement. Spreading xenophobia, hurling accusations of murder and rape and building walls will not erase history."

Marsha P. Johnson from Kasino, Michael (2012). Pay It No Mind - The Life and Times of Marsha P. Johnson (Documentary film). "How many people have died for these two little statues to be put in the park to recognize gay people? How many years does it take for people to see that we're all brothers and sisters and human beings in the human race? I mean how many years does it take for people to see that we're all in this rat race together."

Typically, we can create a structure that is more friendly and organized for computers to handle. Often, the records are the different texts or documents (speeches, emails, tweets, articles) in plain text format. If there are any fields included in the text data, then they may be the document source, ID of the document, published date, etc. Alternatively, we could transpose this idea so that records are the document source or text ID, while the fields are the texts. If the text could be separated into individual words such that each record is a new unique term, then we may call this structure a term-document matrix (TDM). Its transpose - a document-term matrix (DTM). We will revisit TDMs and DTMs later in the semester.

Reading Comprehension

After reading the notes, students should be able to answer these questions.

  • Is data saved as .xlsx a structured, semi-structured, or unstructured data?

  • Are comma-separated datasets only allowed to have commas as delimiters?

Try It Out!

After reading the notes, students should be able to attempt these problems.

  • Import the .csv and .json file in your preferred programming language.

  • For each data, show the printout and the descriptor portion.


Formatting data

Formatting data, especially fields, are a common task prior to data wrangling. It may be important to ensure the information you are accessing is represented and most easily understood. Two frequent aspects of formatting involve dates, times, and characters.

Date and time

When a data set contains date and time information in the fields (columns), the dates and times may be correctly imported internally by the programming language, but misinterpreted externally by the users. Most programming languages, operating systems, and software internally store dates and times as a value in reference to some specific date. For example, in SAS, the reference date is January 1, 1960. In R and Python, internal dates and times are in reference to January 1, 1970. You may find it necessary to convert strings into date values or re-format existing date values.

Below is a table of some standard date formats that work across Python and R. For more information about your programming language’s specific formatting for dates, see Python with datetime module, SAS, R with the tiyverse, and R with base R functionality.

Code Meaning
%a Abbreviated weekday name
%A Full weekday name
%b Abbreviated month name
%B Full month name
%c Date and time
%d Day of the month (0-31)
%H Hours (24 hour)
%I Hours (12 hour)
%j Day of the year numbered (000-366)
%m Month numbered (01-12)
%M Minute numbered (00-59)
%p AM/PM
%S Second numbered (00-61)
%U Week of the year starting on Sunday numbered (00-53)
%w Weekday starting on Sunday numbered (0-6)
%W Week of the year starting on Monday numbered (00-53)
%y 2-digit year
%Y 4-digit year
%z Offset from UTC
%Z Time zone (character)

Below is a table of some common date formats that work in SAS.

Format External Date Example
MMDDYY10. 01/01/1960
MMDDYY8. 01/01/60
MMDDYY6. 010160
DDMMYY10. 31/12/1960
DDMMYY8. 31/12/60
DDMMYY6. 311260
DATE7. 131DEC59
DATE9. 31DEC1959
WORDDATE. January 1, 1960
WEEKDATE. Friday, January 1, 1960
MONYY7. JAN1960
YEAR4. 1960

Character encoding

In some text datasets, the files are saved with a particular character encoding. Some common encodings include Unicode (UTF-8), American Standard Code for Information Interchange (ASCII), Chinese Guobiao, and Korean encodings. The tricky part with text data is that if the way you assumed the characters are encoded in your data are different from the actual encoding, then your text file may not get imported successfully. You may run into issues with missing characters and the accessing and importing process may stop due to these issues. Even bigger issues with the text analysis because you don’t have all the correct characters to identify features and the unique terms.

Try It Out!

After reading the notes, students should be able to attempt these problems.

  • Create one single data set for the two texts above. Use ASCII for the encoding.

  • Do you think the way that you organized the data can be improved?

  • Import the data with a URL in your preferred programming language, by uploading the data to your Box and creating the link.

  • Are there any encoding or data processing issues? If so, correct them and re-import the data.