How To Create Dataset For Machine Learning
A dataset contains related data values that are collected or measured as part of a cohort study. Datasets are keyed by both subject and time. For example, laboratory tests or information collected about a participant over time, where there are many rows per participant, but only one for each participant at each time.
A dataset's properties include identifiers, keys, and categorizations for the dataset. It's fields represent columns and establish the shape and content of the data "table". For example, a dataset for physical exams would typically include fields like height, weight, respiration rate, blood pressure, etc.
The set of fields ensure the upload of consistent data records by defining the acceptable types, and can also include validation and conditional formatting when necessary. There are system fields built in to any dataset, such as creation date, and because datasets are part of studies, they must also include columns that will map to participants and time.
This topic covers creating a dataset from within the study UI.
- Create Dataset
- Define Properties
- Basic Properties
- Data Row Uniqueness
- Define Fields
- Infer Fields from a File
- Set Column Mapping
- Import Data with Inferral
- Export/Import Field Definitions
- Manually Define Fields
Create Dataset
- Navigate to the Manage tab of your study folder.
- Click Manage Datasets.
- Click Create New Dataset.
The following sections describe each panel within the creation wizard. If you later edit the dataset, you will return to these panels and be able to change most of the values.
Define Properties
- The first panel defines Dataset Properties.
- Enter the Basic Properties.
- Data Row Uniqueness: Select the appropriate value for how this dataset is keyed.
- Click Advanced Settings to control whether to show the dataset in the overview, manually set the dataset ID, associate this data with cohorts, and use tags as another way to categorize datasets. Learn more in this topic: Dataset Properties.
- Continue to define Fields for your dataset before clicking Save.
Basic Properties
- Name (Required): The dataset name is required and must be unique.
- Label: By default, the dataset Name is shown to users. You can define a Label to use instead if desired.
- Description: An optional short description of the dataset.
- Category: Assigning a category to the dataset will group it with other data in that category when viewed in the data browser. By default, new datasets are uncategorized. Learn more about categories in this topic: Manage Categories.
- The dropdown menu for this field will list currently defined categories. Select one, OR
- Type a new category name to define it from here. Click Create option... that will appear in the dropdown menu to create and select it.
Data Row Uniqueness
Select how unique data rows in your dataset are determined:
- Participants only (demographic data):
- There is one row per participant.
- Participants and timepoints/visits:
- There is (at most) one row per participant at each timepoint or visit.
- Participants, timepoints, and additional key field:
- There may be multiple rows for a participant/time combination, requiring an additional key field to ensure unique rows.
- Learn more in this topic: Dataset Properties
- Note that when using an additional key, you will temporarily see an error in the UI until you create the necessary field to select as a key in the next section.
Define Fields
Click the Fields panel to open it. You can define fields for a new dataset in several ways:
- LabKey can infer the fields from an example data spreadsheet you upload
- You can import a set of field definitions in a JSON file
- You can define them manually
Infer Fields from a File
The Fields panel opens on the Import or infer fields from file option. You can click within the box to select a file or simply drag it from your desktop.
- Supported data file formats include: .csv, .tsv, .txt, .xls, .xlsx.
LabKey will make a best guess effort to infer the names and types for all the columns in the spreadsheet.
- You will now see them in the fields editor that you would use to manually define fields as described below.
- Note that if your file includes columns for reserved fields, they will not be shown as inferred. Reserved fields will always be created for you.
Make any adjustments needed.
- For instance, if a numeric column happens to contain integer values, but should be of type "Decimal", make the change here.
- If you want one of the inferred fields to be ignored, delete it by clicking the .
- If any fields should be required, check the box in the Required column.
Before you click Save you have the option to import the data from the spreadsheet you used for inferral to the dataset you are creating. Otherwise, you will create an empty structure and can import data later.
Set Column Mapping
When you infer fields, you will need to confirm that the Column mapping section below the fields is correct.
Datasets must map to both study subjects (participants) and some sense of time (either dates or sequence numbers for visits). During field inferral, the server will make a guess at these mappings. Use the dropdowns to make changes if needed.
Import Data with Inferral
Near the bottom, you can use the selector to control whether to import data or not. By default, it is set to Import Data and you will see the first three rows of the file. Click Save to create and populate the dataset.
If you want to create the dataset without importing data, either click the or the selector itself. The file name and preview will disappear and the selector will read Don't Import. Click Save to create the empty dataset.
Adding data to a dataset is covered in the topic: Import Data to a Dataset.
Export/Import Field Definitions
In the top bar of the list of fields, you see an Export button. You can click to export field definitions in a JSON format file. This file can be used to create the same field definitions in another list, either as is or with changes made offline.
To import a JSON file of field definitions, use the infer from file method, selecting the .fields.json file instead of a data-bearing file. Note that importing or inferring fields will overwrite any existing fields; it is intended only for new dataset creation. After importing a set of fields, check the column mapping as if you had inferred fields from data.
Learn more about exporting and importing sets of fields in this topic: Field Editor
Manually Define Fields
Instead of using a data-spreadsheet or JSON field definitions, you can click Manually Define Fields. You will also be able to use the manual field editor to adjust inferred or imported fields.
Note that the two required fields are predefined: ParticipantID and Date (or SequenceNum for visit-based studies). You cannot add these fields when defining a dataset manually; you only add the other fields in the dataset.
Click Add Field for each field you need. Use the Data Type dropdown to select the type, and click the to expand field details to set properties.
If you add a field by mistake, click the to delete it.
After adding all your fields, click Save. You will now have an empty dataset and can import data to it.
Related Topics
- Tutorial: Inferring Datasets from Excel and TSV Files: Use study reload to create multiple datasets
- Import Data to a Dataset: Import data to an existing dataset.
- Import From a Dataset Archive: Import a dataset archive via the pipeline.
- Dataset Properties: Understand and edit properties of datasets.
- Field Editor: Add or edit individual fields in a dataset.
- Dataset System Fields
- Date & Number Display Formats
How To Create Dataset For Machine Learning
Source: https://www.labkey.org/Documentation/wiki-page.view?name=createDataset
Posted by: rodriguezmolaing.blogspot.com
0 Response to "How To Create Dataset For Machine Learning"
Post a Comment