Now you can request additional data and/or customized columns!

Try It Now!

Cervical cancer

machine-learning

Files Size Format Created Updated License Source
2 976kB csv zip 1 year ago UCI - Cervical cancer (Risk Factors) Data Set
This is dataset about cervical cancer occurrences. Cervical cancer is one the most frequent cancer diseases that occur to women. This dataset is showing some factors that might influence cervical cancer. Data This dataset was found on UCI under the name Cervical cancer (Risk Factors) Data read more
Download Developers

Data Files

Download files in this dataset

File Description Size Last changed Download
cervical-cancer 95kB csv (95kB) , json (724kB)
cervical-cancer_zip Compressed versions of dataset. Includes normalized CSV and JSON data with original data and datapackage.json. 33kB zip (33kB)

cervical-cancer  

Signup to Premium Service for additional or customised data - Get Started

This is a preview version. There might be more data in the original version.

Field information

Field Name Order Type (Format) Description
Age 1 integer (default)
Number of sexual partners 2 number (default)
First sexual intercourse 3 number (default)
Num of pregnancies 4 number (default)
Smokes 5 number (default)
Smokes (years) 6 number (default)
Smokes (packs/year) 7 number (default)
Hormonal Contraceptives 8 number (default)
Hormonal Contraceptives (years) 9 number (default)
IUD 10 number (default)
IUD (years) 11 number (default)
STDs 12 number (default)
STDs (number) 13 number (default)
STDs:condylomatosis 14 number (default)
STDs:cervical condylomatosis 15 number (default)
STDs:vaginal condylomatosis 16 number (default)
STDs:vulvo-perineal condylomatosis 17 number (default)
STDs:syphilis 18 number (default)
STDs:pelvic inflammatory disease 19 number (default)
STDs:genital herpes 20 number (default)
STDs:molluscum contagiosum 21 number (default)
STDs:AIDS 22 number (default)
STDs:HIV 23 number (default)
STDs:Hepatitis B 24 number (default)
STDs:HPV 25 number (default)
STDs: Number of diagnosis 26 integer (default)
STDs: Time since first diagnosis 27 string (default)
STDs: Time since last diagnosis 28 string (default)
Dx:Cancer 29 integer (default)
Dx:CIN 30 integer (default)
Dx:HPV 31 integer (default)
Dx 32 integer (default)
Hinselmann 33 integer (default)
Schiller 34 integer (default)
Citology 35 integer (default)
Biopsy 36 integer (default)

Integrate this dataset into your favourite tool

Use our data-cli tool designed for data wranglers:

data get https://datahub.io/machine-learning/cervical-cancer
data info machine-learning/cervical-cancer
tree machine-learning/cervical-cancer
# Get a list of dataset's resources
curl -L -s https://datahub.io/machine-learning/cervical-cancer/datapackage.json | grep path

# Get resources

curl -L https://datahub.io/machine-learning/cervical-cancer/r/0.csv

curl -L https://datahub.io/machine-learning/cervical-cancer/r/1.zip

If you are using R here's how to get the data you want quickly loaded:

install.packages("jsonlite", repos="https://cran.rstudio.com/")
library("jsonlite")

json_file <- 'https://datahub.io/machine-learning/cervical-cancer/datapackage.json'
json_data <- fromJSON(paste(readLines(json_file), collapse=""))

# get list of all resources:
print(json_data$resources$name)

# print all tabular data(if exists any)
for(i in 1:length(json_data$resources$datahub$type)){
  if(json_data$resources$datahub$type[i]=='derived/csv'){
    path_to_file = json_data$resources$path[i]
    data <- read.csv(url(path_to_file))
    print(data)
  }
}

Note: You might need to run the script with root permissions if you are running on Linux machine

Install the Frictionless Data data package library and the pandas itself:

pip install datapackage
pip install pandas

Now you can use the datapackage in the Pandas:

import datapackage
import pandas as pd

data_url = 'https://datahub.io/machine-learning/cervical-cancer/datapackage.json'

# to load Data Package into storage
package = datapackage.Package(data_url)

# to load only tabular data
resources = package.resources
for resource in resources:
    if resource.tabular:
        data = pd.read_csv(resource.descriptor['path'])
        print (data)

For Python, first install the `datapackage` library (all the datasets on DataHub are Data Packages):

pip install datapackage

To get Data Package into your Python environment, run following code:

from datapackage import Package

package = Package('https://datahub.io/machine-learning/cervical-cancer/datapackage.json')

# print list of all resources:
print(package.resource_names)

# print processed tabular data (if exists any)
for resource in package.resources:
    if resource.descriptor['datahub']['type'] == 'derived/csv':
        print(resource.read())

If you are using JavaScript, please, follow instructions below:

Install data.js module using npm:

  $ npm install data.js

Once the package is installed, use the following code snippet:

const {Dataset} = require('data.js')

const path = 'https://datahub.io/machine-learning/cervical-cancer/datapackage.json'

// We're using self-invoking function here as we want to use async-await syntax:
;(async () => {
  const dataset = await Dataset.load(path)
  // get list of all resources:
  for (const id in dataset.resources) {
    console.log(dataset.resources[id]._descriptor.name)
  }
  // get all tabular data(if exists any)
  for (const id in dataset.resources) {
    if (dataset.resources[id]._descriptor.format === "csv") {
      const file = dataset.resources[id]
      // Get a raw stream
      const stream = await file.stream()
      // entire file as a buffer (be careful with large files!)
      const buffer = await file.buffer
      // print data
      stream.pipe(process.stdout)
    }
  }
})()

Read me

This is dataset about cervical cancer occurrences. Cervical cancer is one the most frequent cancer diseases that occur to women. This dataset is showing some factors that might influence cervical cancer.

Data

This dataset was found on UCI under the name Cervical cancer (Risk Factors) Data Set

The dataset was collected at ‘Hospital Universitario de Caracas’ in Caracas, Venezuela. The dataset comprises demographic information, habits, and historic medical records of 858 patients. Several patients decided not to answer some of the questions because of privacy concerns (missing values).

  • 835 instances
  • 36 attributes
  • Missing values: yes

Output data is located in directory called data

data/cervical-cancer.csv

Attributes are the same as they were in input data.

Preparation

To get our output data several things are done to input data:

  • missing values marked with “?” are replaced with “”(empty space)

Python scripts are located in directory scripts

scripts/main.py

License

Licensed under the Public Domain Dedication and License (assuming either no rights or public domain license in source data).

Datapackage.json

Request Customized Data


Notifications of data updates and schema changes

Warranty / guaranteed updates

Workflow integration (e.g. Python packages, NPM packages)

Customized data (e.g. you need different or additional data)

Or suggest your own feature from the link below