The Internet of Things (IoT) results in tremendous amounts of data in many organizations. This data may be stored in the cloud, in on-premises databases, or even within edge devices themselves. The IoT solution should provide a way to get this data to the right location, in the right format, at the right time. Data access and transfer can be used to get the data to the right location. Data transformation can be used to get the data into the right format. Automation can be used to make it all happen at the right time so that the data is there when you need it.
Part of this process, in the world of data science, is called data wrangling. This term is borrowed from the horse ranching (and other ranching worlds) domain. A wrangler, in that world, is one who herds the horses to bring them together to a target location for better control and care. Wrangling is the process used to accomplish the herding. Interestingly, another general definition of wrangling is to argue, quarrel, or dispute.
In the case of our IoT data, both of these definitions apply. First, data wrangling brings the data from multiple sources together into a single location. This action is very important in IoT because the data may originate in several disconnected systems. For example, some of the data may come from sensors that work through an MQTT broker to store data in a MongoDB database while other data may come from sensors directly as the data is stored only on the sensors. Wrangling can be used to bring this data together into a central location for analysis.
In practical terms, wrangling data from multiple sources can be achieved through specialized applications that support accessing data in multiple database source formats. It can also be achieved using scripting languages like Python as long as the language (or some added module) has the ability to communicate with the data source. Of course, the appropriate networking protocols and authentication protocols must also be supported to gain access to the systems.
To pull the data from multiple sources, several processes can be used including:
- API access
- Web scraping
- Database connections
- Direct file access
The first step in data wrangling is completed through data access and transfer; however, the wrangling process may not be completed. The horse wrangler will inspect the resulting herd to ensure the proper horses have been selected and no desired horses have been left out. In the same way, the IoT data must be evaluated to ensure that it is accurate – that it is the correct data. Has the right data been retrieved? Is the data structured correctly based on expected data structures? If not, some source of corruption may be in play. Is the data complete? If not, the sensors may be intermittently failing or communications with the network may be failing – resulting in lost data. Ultimately, you’re looking for missing data, incorrect data, improperly labeled data, etc.
The final step is to clean, purge, prepare, or whatever you want to call it. This is about getting the data into the format needed for analysis and removing any unneeded data. For example, one system may be reporting temperatures in Celsius while another is reporting in Fahrenheit. Which format do you want to use for analysis? Converting all of the data into a consistent format first can reduce the workload in the actual analysis process.
It is important to know that data wrangling may be required at multiple levels. Consider the wrangler who goes to the herd and selects out just 25 of the horses to place in a separate pen. Next, the wrangler may separate 7 of the horses from the 25 into yet another pen where each one is evaluated and treated as needed by a veterinarian. In the same way, the data set resulting from the above steps may still be too much for a given data analysis tasks. The data scientists may further wrangle the data for her specific needs.
The intent of this article is not to walk you step-by-step through all of the steps in the data wrangling process or to even cover every action that you might take. However, with this basic understanding, you can begin to wrap your mind around what data wrangling actions might be required in your next IoT project.
-Tom