Loading a dataset from text files

The first step of loading a dataset from a text file requires that we create a Dataset class object, using the Dataset constructor. The constructor can create a Dataset given a wide range of arguments, e.g. as strings (which are interpreted as file names), list of lists, and numeric arrays. In this tutorial we'll only demonstrate how to construct a Dataset from a text file as that is the most common usage pattern.

Dataset objects can be constructed from text files that are formatted as follows. A data file should have one line (row) per observation (e.g. a gene), with its column values (e.g. the conditions) separated by a delimiter such as the tab or comma character. The tab delimiter is default, but you can specify an alternative delimiter by passing an extra argument e.g. delimiter=',' to the constructor. If the first column's data values are not numeric (e.g. if they are gene names), that situation will be detected and the entire first column will be ignored. All lines in the file are read as data unless a line begins with the Python comment character '#'. We are working to create a "smart" data file loader that will handle a wider range of input file formats.

When loading a dataset and its annotations from text files, there are two approaches for specifying the location of data files to load. You can change your current working directory to the directory that contains your dataset files, and then simply reference each file by its short file name (Option 1). Or you can construct the complete path name for each of the example files that you will be loading (Option 2). Either way, you have to know where your data files are prior to loading the data.

Where is your data?

If you are using CompClustShell for Windows (and you used the default installation options), the Cho example dataset files should be located in the directory:

  C:\Program Files\CompClustShell\Examples\ChoCellCycling

If you are using the CompClust source distribution instead, the example dataset can be found within the CompClust source code in the directory (shown here using UNIX path separators):


For convenience, especially if you wish to specify full path names when you load your files, you can define a variable that holds the data root directory explicitly, e.g. in Windows:

dataroot = 'C:\Program Files\CompClustShell\Examples\ChoCellCycling'

Or in a Unix-based operating system the path might be, for example:

dataroot = '/User/sam/compClust/CompClust/gui/examples/ChoCellCycling'

In CompClust, we store the Cho example data directory location in a configuration variable so we have a sure-fire way to know where our example data is, no matter which operating system you have:

dataroot = config.cho_data_dir

Once you know where your data is, you can either change to that directory and load with short file paths, or include data directory in the full file path.

Option 1: Change directories, then load files

Changing directories is very easy in CompClustShell and IPython (on which CompClustShell is based). You can use a simple "cd" command, e.g.:

cd Examples/ChoCellCycling

If you're using a basic Python shell, giving explicit examples for changing directories can be complicated because the exact form depends on the operating system. Here we'll keep it simple and assume you've defined the dataroot variable as above properly for your circumstances, so that you can simply type:


Once you have changed to the directory containing the datafiles, you can then construct a Dataset from a text file (and name it 'cho') as follows:

cho = datasets.Dataset('ChoCycling.dat','cho')

Option 2: Loading files specified with full file paths

If instead you wish to specify full path to the data files (rather than changing directories), assuming you've defined the data path as above, you would use this command:

cho = datasets.Dataset(os.path.join(dataroot,'ChoCycling.dat'),'cho')

As a final check that creating the Dataset was successful, the following "cho.numRows" Python statement should return 384. E.g.:


Joe Roden 2005-12-13