Dataset objects can be constructed from text files that are formatted as follows. A data file should have one line (row) per observation (e.g. a gene), with its column values (e.g. the conditions) separated by a delimiter such as the tab or comma character. The tab delimiter is default, but you can specify an alternative delimiter by passing an extra argument e.g. delimiter=',' to the constructor. If the first column's data values are not numeric (e.g. if they are gene names), that situation will be detected and the entire first column will be ignored. All lines in the file are read as data unless a line begins with the Python comment character '#'. We are working to create a "smart" data file loader that will handle a wider range of input file formats.
When loading a dataset and its annotations from text files, there are two approaches for specifying the location of data files to load. You can change your current working directory to the directory that contains your dataset files, and then simply reference each file by its short file name (Option 1). Or you can construct the complete path name for each of the example files that you will be loading (Option 2). Either way, you have to know where your data files are prior to loading the data.
Where is your data?
If you are using CompClustShell for Windows (and you used the default installation options), the Cho example dataset files should be located in the directory:
C:\Program Files\CompClustShell\Examples\ChoCellCycling
If you are using the CompClust source distribution instead, the example dataset can be found within the CompClust source code in the directory (shown here using UNIX path separators):
compClust/gui/Examples/ChoCellCycling
For convenience, especially if you wish to specify full path names when you load your files, you can define a variable that holds the data root directory explicitly, e.g. in Windows:
dataroot = 'C:\Program Files\CompClustShell\Examples\ChoCellCycling'
Or in a Unix-based operating system the path might be, for example:
dataroot = '/User/sam/compClust/CompClust/gui/examples/ChoCellCycling'
In CompClust, we store the Cho example data directory location in a configuration variable so we have a sure-fire way to know where our example data is, no matter which operating system you have:
dataroot = config.cho_data_dir
Once you know where your data is, you can either change to that directory and load with short file paths, or include data directory in the full file path.
Option 1: Change directories, then load files
Changing directories is very easy in CompClustShell and IPython (on which CompClustShell is based). You can use a simple "cd" command, e.g.:
cd Examples/ChoCellCycling
If you're using a basic Python shell, giving explicit examples for changing directories can be complicated because the exact form depends on the operating system. Here we'll keep it simple and assume you've defined the dataroot variable as above properly for your circumstances, so that you can simply type:
os.chdir(dataroot)
Once you have changed to the directory containing the datafiles, you can then construct a Dataset from a text file (and name it 'cho') as follows:
cho = datasets.Dataset('ChoCycling.dat','cho')
Option 2: Loading files specified with full file paths
If instead you wish to specify full path to the data files (rather than changing directories), assuming you've defined the data path as above, you would use this command:
cho = datasets.Dataset(os.path.join(dataroot,'ChoCycling.dat'),'cho')
As a final check that creating the Dataset was successful, the following "cho.numRows" Python statement should return 384. E.g.:
cho.numRows() 384
Joe Roden 2005-12-13