5 Transform and Load Data¶
The cannuckfind system will not perform the extraction rountine from the API of the social network. Each application or situation will require its own Extraction Procedure. They will likely be required to confirm to various institutional requirements with respect to privacy and copyright etc. The libraries to start this are
5.1 Import Library¶
To use these features import the following from the cannuckfind.
import geolocation
from geolocation import unknownC
import inDat
5.2 OpenFile¶
The first step is to import the file extracted from the datasource. It is presumed that the extract procedure has been prepared and will be unique to the source of data. This extraction procedure will produce an external file that is assumed to be a large pandas dataframe. The organization is assummed to be of a particular structure. This structure is documented in the Annex.
- class inDat.inDat(thefile=101)¶
- SAMPSIZE¶
load random sample size
- SEED¶
Default Seed for Random Sample
- datProg¶
stage of data analysis
- inDat.__init__()¶
Creates object to import external data :param exfile: path to external file
A separate method exists for loading subsequent files.
with each string containing a potential location, it can be loaded with this method.
- inDat.loadrawfile()¶
load file extracted from raw source
5.3 Monitor Progress¶
An info method is available
- inDat.inDatinfo()¶
convience function to give quick look at the data return 1 if rawfile read in 2 if there is a sample of it -1 and something is wrong
This implementation is peculiar to this package, as returns a categorical variable indicating the stage of the processing of the data. However, it does print to screen the summary data available at this stage.
5.3.1 Return Values for inDatinfo method¶
The method returns integer values from -1 to 4 which are stored in member constant datProg. These values are intended to be indicators of progress in the machine learning exercise. There interpretations are:
Value |
Progress |
---|---|
-1 |
Error |
0 |
Not Initialized |
1 |
Raw Sample Only |
2 |
Single Sample |
3 |
Replacement Sampling |
4 |
No Replacement Sampling |
In the above table we see six possible situations can exist.
5.4 Take sample¶
The primary purpose of the exercise is to produce a sample that is of a size amenable to analysis. A more secondary purpose is to support statistical techniques that rely on bootstrapping.
- inDat.loadrandomsample()¶
In the following example a random sample of 4 is the sample of 8.
testL1 = inDat.inDat('ex8.csv')
self.assertTrue(testL1.inDat.shape[0] == 8)
self.assertTrue(testL1.inDat.shape[1] == 16)
testS1 = testL1.loadrandomsample(smpsize = 4)
self.assertTrue(testS1.shape[0] == 4)
self.assertTrue(testS1.shape[1] == 4)
TestL1 has a sample size of 8 as show by the value of shape[0]. A random sample of 4 is selected from TestL1 to create TestS1.