Now you can request additional data and/or customized columns!

Try It Now!

Primary tumor

machine-learning

Files Size Format Created Updated License Source
2 200kB csv zip 1 year ago OpenML - Primary tumor
This is a dataset about primary tumors in people. Locations of primary tumors are locations in body where the tumor first appeared and from there started to metastasize to other parts of the body. Data This dataset was found on OpenML - primary-tumor This primary tumor domain was obtained from read more
Download Developers

Data Files

Download files in this dataset

File Description Size Last changed Download
primary-tumor 32kB csv (32kB) , json (104kB)
primary-tumor_zip Compressed versions of dataset. Includes normalized CSV and JSON data with original data and datapackage.json. 10kB zip (10kB)

primary-tumor  

Signup to Premium Service for additional or customised data - Get Started

This is a preview version. There might be more data in the original version.

Field information

Field Name Order Type (Format) Description
age 1 string (default) <30, 30-59, >=60
sex 2 string (default) female, male
histologic-type 3 string (default) epidermoid, adeno, anaplastic
degree-of-diffe 4 string (default) well, fairly, poorly
bone 5 boolean (default)
bone-marrow 6 boolean (default)
lung 7 boolean (default)
pleura 8 boolean (default)
peritoneum 9 boolean (default)
liver 10 boolean (default)
brain 11 boolean (default)
skin 12 boolean (default)
neck 13 boolean (default)
supraclavicular 14 boolean (default)
axillar 15 boolean (default)
mediastinum 16 boolean (default)
abdominal 17 boolean (default)
class 18 string (default) lung, head & neck, esophasus, thyroid, stomach, duoden & sm.int, colon, rectum, anus, salivary glands, pancreas, gallblader, liver, kidney, bladder, testis, prostate, ovary, corpus uteri, cervix uteri, vagina, breast

Integrate this dataset into your favourite tool

Use our data-cli tool designed for data wranglers:

data get https://datahub.io/machine-learning/primary-tumor
data info machine-learning/primary-tumor
tree machine-learning/primary-tumor
# Get a list of dataset's resources
curl -L -s https://datahub.io/machine-learning/primary-tumor/datapackage.json | grep path

# Get resources

curl -L https://datahub.io/machine-learning/primary-tumor/r/0.csv

curl -L https://datahub.io/machine-learning/primary-tumor/r/1.zip

If you are using R here's how to get the data you want quickly loaded:

install.packages("jsonlite", repos="https://cran.rstudio.com/")
library("jsonlite")

json_file <- 'https://datahub.io/machine-learning/primary-tumor/datapackage.json'
json_data <- fromJSON(paste(readLines(json_file), collapse=""))

# get list of all resources:
print(json_data$resources$name)

# print all tabular data(if exists any)
for(i in 1:length(json_data$resources$datahub$type)){
  if(json_data$resources$datahub$type[i]=='derived/csv'){
    path_to_file = json_data$resources$path[i]
    data <- read.csv(url(path_to_file))
    print(data)
  }
}

Note: You might need to run the script with root permissions if you are running on Linux machine

Install the Frictionless Data data package library and the pandas itself:

pip install datapackage
pip install pandas

Now you can use the datapackage in the Pandas:

import datapackage
import pandas as pd

data_url = 'https://datahub.io/machine-learning/primary-tumor/datapackage.json'

# to load Data Package into storage
package = datapackage.Package(data_url)

# to load only tabular data
resources = package.resources
for resource in resources:
    if resource.tabular:
        data = pd.read_csv(resource.descriptor['path'])
        print (data)

For Python, first install the `datapackage` library (all the datasets on DataHub are Data Packages):

pip install datapackage

To get Data Package into your Python environment, run following code:

from datapackage import Package

package = Package('https://datahub.io/machine-learning/primary-tumor/datapackage.json')

# print list of all resources:
print(package.resource_names)

# print processed tabular data (if exists any)
for resource in package.resources:
    if resource.descriptor['datahub']['type'] == 'derived/csv':
        print(resource.read())

If you are using JavaScript, please, follow instructions below:

Install data.js module using npm:

  $ npm install data.js

Once the package is installed, use the following code snippet:

const {Dataset} = require('data.js')

const path = 'https://datahub.io/machine-learning/primary-tumor/datapackage.json'

// We're using self-invoking function here as we want to use async-await syntax:
;(async () => {
  const dataset = await Dataset.load(path)
  // get list of all resources:
  for (const id in dataset.resources) {
    console.log(dataset.resources[id]._descriptor.name)
  }
  // get all tabular data(if exists any)
  for (const id in dataset.resources) {
    if (dataset.resources[id]._descriptor.format === "csv") {
      const file = dataset.resources[id]
      // Get a raw stream
      const stream = await file.stream()
      // entire file as a buffer (be careful with large files!)
      const buffer = await file.buffer
      // print data
      stream.pipe(process.stdout)
    }
  }
})()

Read me

This is a dataset about primary tumors in people. Locations of primary tumors are locations in body where the tumor first appeared and from there started to metastasize to other parts of the body.

Data

This dataset was found on OpenML - primary-tumor

This primary tumor domain was obtained from the University Medical Centre, Institute of Oncology, Ljubljana, Yugoslavia. Thanks go to M. Zwitter and M. Soklic for providing the data. _Please include this citation if you plan to use this database.

Data is located in directory data

data/primary-tumor.csv

Preparation

To get our output data several things are done to input data:

  • missing values marked with “?” are replaced with “”(empty space)
  • all " are removed
  • all ’ are removed
  • yes and no values are replaced with true and false

Scripts are in directory scripts

scripts/main.py

License

Licensed under the Public Domain Dedication and License (assuming either no rights or public domain license in source data).

Datapackage.json

Request Customized Data


Notifications of data updates and schema changes

Warranty / guaranteed updates

Workflow integration (e.g. Python packages, NPM packages)

Customized data (e.g. you need different or additional data)

Or suggest your own feature from the link below