43
1) ICS 313: Programming Language Theory Module 06: Data Types

(1) ICS 313: Programming Language Theory Module 06: Data Types

Embed Size (px)

Citation preview

Page 1: (1) ICS 313: Programming Language Theory Module 06: Data Types

(1)

ICS 313: Programming Language

TheoryModule 06:

Data Types

Page 2: (1) ICS 313: Programming Language Theory Module 06: Data Types

(2)

Objectives To understand basic issues in the design and implementation of typical data types.

Page 3: (1) ICS 313: Programming Language Theory Module 06: Data Types

(3)

Central Goal of Typed Data To model the real-world problem space as closely and efficiently as possible.

Evolution:•Fortran-I: Numeric and Array; floating point modeled with ints

•PL1: everything for everyone •Algol: few basic types and user definitions•Simula, Java: Entities modeled with classes

Evolutionary progression:•Association of data with functions.•Abstractions for maintaining/assessing interdependencies automatically.

Page 4: (1) ICS 313: Programming Language Theory Module 06: Data Types

(4)

Primitive and Structured Data Types

Primitive Data Types:•Data types not defined by other types.

-Reflect hardware support directly or with minor software support.

-Examples: integers, floats, strings, etc.

Structured Data Types:•Primitive types + “type constructors”

Page 5: (1) ICS 313: Programming Language Theory Module 06: Data Types

(5)

Built In Primitives

(These are the options: they aren’t all built in to all languages)

Page 6: (1) ICS 313: Programming Language Theory Module 06: Data Types

(6)

Numeric Types: Integer Only primitive data type found in early languages (except Lisp)

Integer:•Different sizes possible: 1-8 bytes•Arbitrarily large in Lisp•Representation: string of bits

-leftmost can represent the sign-twos complement better for computer math

•Direct support in hardware.

Page 7: (1) ICS 313: Programming Language Theory Module 06: Data Types

(7)

Numeric Types: Floating Point

Floating Point:•Approximations for real numbers.•Typically stored in binary (base 2)

-means 0.1 cannot be represented exactly!•Two levels of accuracy

-Real (typically four bytes, 1/8/23)-Double (typically eight bytes, 1/11/52)

•Representation (IEEE Standard):

Sign Exponent Fraction

1 8 or 11 23 or 52

Range Precision (and range)

Try this in Python: x = 3.4x

Page 8: (1) ICS 313: Programming Language Theory Module 06: Data Types

(8)

Numeric Types: Decimal Decimal types:

•Store a fixed number of decimal digits with decimal point in fixed position.

•Mandatory for business application process. •Business (mainframes) have hardware support.

•Others implement decimal using integers and software.

Representation:•1-2 digits encoded into each byte.•Example: 9352.14 in three bytes

1001 0011 0101 0010 . 0001 0100

Page 9: (1) ICS 313: Programming Language Theory Module 06: Data Types

(9)

Numeric Types: Boolean Simplest type: two values (true and false).

•Requires only one bit to implement.•Typically implemented as a byte.•Included in most languages since Algol.

Some languages do not have Boolean type:•In C (and C++):

-0 is false, all other numeric values true.•Lisp:

-“nil” false, all others true.•Python extends these conventions to many primitives:-0, ‘’, (), [], {} are all false.

•Scheme uses a mixture of Boolean and others: - #f false, #t and all others true

Page 10: (1) ICS 313: Programming Language Theory Module 06: Data Types

(10)

Characters and Strings Character type:

•Stored as numeric encodings-ASCII popular, but limited to 127 chars.-UNICODE used in Java, ASCII superset, most natural language characters.

Character strings•Design issues

-Are strings a primitive or structured data type (i.e. array of chars)

-Are strings fixed or variable length?

Page 11: (1) ICS 313: Programming Language Theory Module 06: Data Types

(11)

String Operations Required operations:

•equality, concat, <, >, substring, etc. Pascal, Ada:

•strings are predefined as array of chars.•built-in string operations•Index slice (like immutable strings in Python: s[3:7])

C, C++:•strings are implemented as an array of chars with special null character terminator.

•library package provides string operations. Scheme: Strings are primitive constants Java: String immuatble, StringBuffer mutable

Page 12: (1) ICS 313: Programming Language Theory Module 06: Data Types

(12)

Pattern Matching Built in (PERL, SNOBOL4, ICON) versus Library (Python, …)

Examples (PERL, Python): What do these represent?

•/[A-Za-z][A-Za-z\d]*/ •/\d+\.?\d*|\.\d+/

Page 13: (1) ICS 313: Programming Language Theory Module 06: Data Types

(13)

String Length Choices Static:

•Length specified in declaration of string.•Example: Fortran. (Blank fill)

Limited dynamic:•Maximum length specified in declaration.•Example: C, C++ (null character ends)

Dynamic:•No length specified; shrinks/grows as needed.•Example: Common Lisp, Java

Page 14: (1) ICS 313: Programming Language Theory Module 06: Data Types

(14)

String Implementation Static (used only at compile time):

“The String”

Limited Dynamic (run time, except in C which uses null, doesn’t check):

Length Address

Curr Len. Max Len. Address “The String”

Dynamic (used at run time):

“The String”AddressCurLen

Page 15: (1) ICS 313: Programming Language Theory Module 06: Data Types

(15)

String in memory

How is this allocated? Linked list:

•Faster allocation/deallocation as size changes•Slower operations•More memory

Contiguous storage•Slower reallocation during growth•Faster operations

“The String”

Page 16: (1) ICS 313: Programming Language Theory Module 06: Data Types

(16)

User Defined Primitives

Page 17: (1) ICS 313: Programming Language Theory Module 06: Data Types

(17)

Enumeration Types All possible (symbolic) values are explicitly stated in the type declaration:•type WEEKEND = (Sat, Sun);•type DAY = (Mon, Tue, Wed, Thu, Fri, Sat, Sun);•Of what type is ‘Sat’? (overloaded literal)

-Pascal, C don’t allow it-Ada allows it

Advantages of enumerated types over “numeric encoding” (i.e. Sat = 1, Sun = 2):•Provides greatly increased readability.•Prevents use of inappropriate operations or values

Implemented w/Integers, range checks

Page 18: (1) ICS 313: Programming Language Theory Module 06: Data Types

(18)

Subrange Types Subsequence of ordinal, e.g.:

•Pascal: index = 1..100•Python: for x in range(10)

Subtype: •Restricted range of type•Compatible with parent

Derived type:•Also restricted range, but not compatible

Good for readability and reliability Implemented like parent with range checks

Page 19: (1) ICS 313: Programming Language Theory Module 06: Data Types

(19)

Structured Types

Most of these are built in types, although in the case of records (structures) and pointers the

programmer then uses them to define specialized types

Page 20: (1) ICS 313: Programming Language Theory Module 06: Data Types

(20)

Arrays

A homogeneous aggregate of typed data elements with elements identified by position.

Issues: •Syntax: A(i), A[i] •Subscript types: allow any ordinal type?

-Definition: A[DAY]-Use: A[Mon]

•Range checked?

Page 21: (1) ICS 313: Programming Language Theory Module 06: Data Types

(21)

Array Categories Static arrays:

•Subscript ranges (and data element types) are statically bound and storage allocation done at compile-time

•FORTRAN up to 77•Most time efficient, can waste memory

Fixed stack-dynamic arrays:•Subscript ranges/element types statically bound but allocation done at run time.

•Supports re-use of large array spaces.•Pascal, C

Page 22: (1) ICS 313: Programming Language Theory Module 06: Data Types

(22)

Array Categories (cont.) Stack-dynamic arrays:

•Subscript ranges bound and storage allocated at run-time, but constant for lifetime of variable.

Heap-dynamic arrays:•Subscript ranges bound and storage allocated at run-time and can change.

•Allows greatest flexibility (array can grow or shrink.)

•Java Vector •Least efficient.

Page 23: (1) ICS 313: Programming Language Theory Module 06: Data Types

(23)

Array Operations Operate on array as unit. Some languages provide no array operations. Examples of operations:

•Assignment•Concatenation•Relational operations•Pair-wise +, -, *, / •Operations on Slices (FORTRAN90, Python)

APL is the most radical programming language for array processing•Array reversal, transposition, inversion

Page 24: (1) ICS 313: Programming Language Theory Module 06: Data Types

(24)

Array Implementation For 1-based, 1-dimensional array:

•address(A[k]) =(address(A[1]) - element_size) +(k * element_size)

Issue: when is array element address computed?•Static arrays:

-element_size and address(A[1]) computed at compile time.

-Run-time computation:- address(A[k]) = k * constant

•Other array types require lookup of A[1] at run-time.

Page 25: (1) ICS 313: Programming Language Theory Module 06: Data Types

(25)

Multidimensional Arrays Map to linear memory:

•Row-major storage (most languages): -lowest value of first subscript stored first-a b c d e f g h i

•Column-major storage (FORTRAN): -lowest value of last subscript stored first-a d g b e h c f i

For 1-based, 2-dimensional array in row-major order:•address(A[i,j]) =

address(A[1,1]) + ((((i - 1) * n)) + (j - 1)) * elementSize

Why should a programmer care? •Large arrays may cross page boundaries in virtual memory

•Access cells in the wrong order and you create a lot of swapping

a b cd e fg h i

Page 26: (1) ICS 313: Programming Language Theory Module 06: Data Types

(26)

Associative Arrays Also known as Hash tables

•Index by key (part of data) rather than value•Store both key and value (take more space) •Best when access is by data rather than index

Examples: •Lisp alist:

-((key1 . data1) (key2 . data2) (key3 . data3) •Python Dictionary:

-{key1 : data1, key2 : data2, key3 : data3}•Java:

-Java.util.Hashtable

Page 27: (1) ICS 313: Programming Language Theory Module 06: Data Types

(27)

Sets Useful to shorten booleans:

•If x in set … Implemented as primitive only in Pascal

•Stored as bitstring in one word •Implementation dependent limit on size•Efficient intersection, union, equality

Some languages supply set operations applied to lists (Common Lisp, Prolog).

Java provides interface java.util.Set

Page 28: (1) ICS 313: Programming Language Theory Module 06: Data Types

(28)

Record types A heterogeneous aggregate of typed data elements with elements identified by name.

Operations:•assignment•equality•assign corresponding fields.

Implementation:•Simple and efficient, because field name references are literals bound at compile-time.

•Use offsets to determine address.

Page 29: (1) ICS 313: Programming Language Theory Module 06: Data Types

(29)

Record types

Examples:•COBOL Records:

-NAME OF EMPLOYEE-MOVE CORRESPONDING EMPLOYEE TO REPORT

•Pascal Records: -employee.name-with employee do … name = …

•Ada also has records, uses dot notation •Common Lisp “Structures”:

-(employee-name …) •C also has structures, uses dot notation •Java: use Classes instead

Page 30: (1) ICS 313: Programming Language Theory Module 06: Data Types

(30)

Union types Allow different types of values to be stored at different times during execution. Often used in records (e.g., Pascal record variants)

Example:•Table of symbols and values•Each value may be int, real, or string.•Which would you prefer?

Implementation: •Allocate for largest variant•Discriminated unions include tag field to indicate type

symbol value (max)

string_valuereal_valueint_valuesymbol

Page 31: (1) ICS 313: Programming Language Theory Module 06: Data Types

(31)

Union Type Evaluation Advantages:

•Union types provide storage efficiency.•Get around overly restrictive type system•Pointer arithmetic in language that does not support it directly (access pointer as if int)

Disadvantages:•Are more difficult to type check.•May require run-time type checking.•May lead to lack of any type checking.

Unnecessary in OOL like Java (why?) and functional languages (like ML) that support polymorphism and compile-time type checking.

Page 32: (1) ICS 313: Programming Language Theory Module 06: Data Types

(32)

“The String”

Pointer Types Pointer variables values are memory addresses or one distinguished value (nil).

Pointers provide two capabilities:•Support indirect addresssing.•Enable dynamic memory management.

Note: heap dynamic variables have no name and must be referenced by pointer variables.

FF03

foo_ptr FF03

Will give exampleof binary trees in FORTRAN and pointers

Page 33: (1) ICS 313: Programming Language Theory Module 06: Data Types

(33)

Fundamental Pointer Operations Assignment:

•Sets pointer variable to address of an object.-Direct addressing: assignment done implicitly during variable initialization.

-Indirect addressing: requires an operator that takes an object and returns its address. (ptr = &object in C)

(Reference:)•Occurrence of pointer variable indicates its own address, just as with other variables (ptr).

Dereference:•Occurrence of pointer variable indicates address of object whose address is the value of the pointer variable. (*ptr in C)

Page 34: (1) ICS 313: Programming Language Theory Module 06: Data Types

(34)

Pointer Examples Let’s diagram this C:

•int *ptr;•int i, j; •i = 3; •ptr = &i; •*ptr = 4; •j = *ptr; // compare to j = ptr

Pointer Arithmetic in C •double a[10];•index = 3;•ptr = a; // assigns address(a[0])•ptr = ptr + index; // increments by as many words as needed to skip one array element

Pointers to Records:•(*ptr).name is same as ptr -> name in C •ptr^.name in Pascal

Page 35: (1) ICS 313: Programming Language Theory Module 06: Data Types

(35)

Pointer Problems Type checking:

•If a pointer is allowed to point to more than one type of object, then static type checking is no longer possible (as in C, Lisp).

Page 36: (1) ICS 313: Programming Language Theory Module 06: Data Types

(36)

Dealing with Type Checking

Solution: •Force all pointers to be typed (in terms of the object to which they are dereferenced)

•Example: FORTRAN90 Limits prime use of pointers:

•Polymorphism (void * in C)

Page 37: (1) ICS 313: Programming Language Theory Module 06: Data Types

(37)

Pointer Problems (cont.)

Dangling Pointers: •When a pointer points to an object, but the object has been deallocated.

Can occur when:•The object goes out of scope but the pointer does not. A contrived example … ptr1 = &ptr2 call foo(ptr1) in which *ptr1 = localObjectafter return, try *ptr2

•The object is explicitly deallocatedptr1 = new Object(); ptr2 = ptr1; destroy(ptr1); *ptr2 …

Page 38: (1) ICS 313: Programming Language Theory Module 06: Data Types

(38)

Dealing with Dangling Pointers

Four strategies:•Disallow (in language) explicit deallocation. •Ignore (in compiler) explicit deallocation.

-Then pointers will never point to nothing (but space will never be reclaimed).

•Allow deallocation, reset other pointers.-Incurs run-time overhead.

- Tombstones- Locks and keys

•Allow deallocation and trust the programmer.-Efficient but allows dangling pointers.

Page 39: (1) ICS 313: Programming Language Theory Module 06: Data Types

(39)

Pointer Problems (cont.) Lost objects (garbage):

•When all pointers to a dynamic variable are removed, so that the variable’s value can no longer be referenced but the space is still allocated.ptr1 = new Object(); … ptr1 = new Object();

•Common when beginners think that every declaration needs a value.

•Results in “memory leaks” (memory fills up)

Page 40: (1) ICS 313: Programming Language Theory Module 06: Data Types

(40)

Dealing with Lost Objects The lost object problem can be solved if the language implements automatic storage management. (Java and Lisp)

Two approaches: Reference counting (“eager” approach):

•Object maintains a counter of how many pointers reference it, when counter is decremented to zero, the object is deallocated.

•Reference counting incurs significant overhead on each pointer assignment, but the overhead is distributed throughout the session.

Page 41: (1) ICS 313: Programming Language Theory Module 06: Data Types

(41)

Dealing with Lost Objects (cont.)

Garbage collection (“lazy” approach):•Wait until all storage is allocated, then collect the garbage

•Mark and Sweep GC: -Mark all objects in heap as garbage.-Follow all pointers through heap and reset mark on all objects encountered.

-Deallocate all remaining marked objects. Problems with Mark and Sweep GC:

•Causes the system to “halt” during GC.•Most time-consuming when you really need it.

“Ephemeral” GC overcomes these problems.•Runs before you need it•Generations according to object age (so only part of memory is searched)

Page 42: (1) ICS 313: Programming Language Theory Module 06: Data Types

(42)

Pointer Commentary “Their introduction into high-level languages has been a step backward from which we may never recover.” (C. Hoare, 1973).

“Pointers are thought by many to be essential in imperative languages.” (R. Sebesta, 1996)

“Java has no pointer data type.” (P. Johnson, 1999) “… it remains to be seen … (R. Sebesta, 2002)

Java References •Assignment to (heap dynamic) objects (class instances)

•No dereferencing, so no dangling pointers•Runtime system manages memory, so no lost objects

•No pointer arithmetic: meaningless

Page 43: (1) ICS 313: Programming Language Theory Module 06: Data Types

(43)

End of module 06