Files in Python The Basics. Why use Files? Very small amounts of data – just hardcode them into the program A few pieces of data – ask the user to input

Files in Python

The Basics

Why use Files?

• Very small amounts of data – just hardcode them into the program

• A few pieces of data – ask the user to input them

• More than this, you need an external file stored on secondary storage

External data files

• Handles large amounts of data• Data is independent of program, so program can

change without changing data• Easier to edit data in an editor, instead of during

run of program (can’t go back!)• Use the same data for input to different programs• Output files can be saved for later use• Output of one program can be used for input of

another

Text files versus Binary files

• Text files created by editors, stored as ASCII codes

• Binary files stored as raw binary numbers, have to be handled differently

• Text files are manipulated sequentially only • Binary files can be manipulated sequentially or

randomly (we will not do binary files in this class)

Creating a text data file

• This is done just like creating any other text file• You can use Notepad• You can use the editors of the IDEs that create

Python• You can use a word processor like Word if you

are careful to save as plain text• Store the text file in the same folder as you

put your source code

Delimiters

• The \n (newline) (carriage return) is a very important symbol in text files.

• It delimits what Python calls a ‘line’ in the file.• It gets put into the file whenever you press Enter at

the end of a line• A blank line is represented by two newlines together \

n\n• It matters whether you press Enter at the end of the

last line of the file – some methods in Python will treat the last line differently because of the \n character

Files in Python

Buffers

Why a buffer?

• Computer equipment runs at different speeds, the hard drive and secondary storage in general is MUCH slower than RAM and the CPU, for example

• This is a bottleneck where the faster pieces have to wait for the slower ones to deliver the action or service or data that is needed

• Buffers help this bottleneck! They let the OS bring in a bigger chunk of data to RAM and hold it until the program asks for it = not so many trips to the hard drive

What’s a buffer?

• A buffer is an area of RAM allocated by the OS to manage these bottlenecks

• Every file you open in your program will have a buffer

• Also buffers for keyboard, screen, network connections, modems, printers, etc.

• You the programmer do not have to worry about this happening, it’s automatic!

Buffer for input file

• When you read from a file, the buffer associated with the file is checked first – if there’s still data in the buffer, that’s what your program gets

• If the buffer is exhausted, then the OS is told to get some more data from the file on the HD and put it in the buffer

• This process continues until the program ends or until the file has no more data to read

• Think of a pantry in a house – it’s a buffer between the people in the house and the supermarket

Buffer for output file

• You write in your program to an output file• The data does NOT go directly, immediately to

the hard drive, but to an output buffer• The OS monitors this buffer – when it is full, it

is all written to the hard drive at one time• Think of a garbage can in a house – it is a

buffer to hold trash until it can all be taken to the landfill at one time

Why do I care about buffers?

• You can see most of the action on a buffer is automatic from the point of view of most programmers

• BUT! if you forget to close your file when you are finished with it, the file can be left in an “unfinished” state!

• Some OS’s are bad for not cleaning things up when your program is over – they should close all files automatically but sometimes they don’t!

Why do I care?

• A file in an “unfinished” state may be one of those files you run across after an application has crashed. If you try to erase it, the OS says “no, that file is still busy”, even though it’s not.

• Especially for output files, your file on the hard drive may not get that last buffer of data that you thought your program wrote to the file if you forget to close the file! The file will be missing data or possibly missing altogether if the file was small.

Before the open happens

After the open

After one readline()

After two more readlines

Don’t forget!

• Don’t forget to close your files!– and the close statement must look like– infile.close()No arguments in the parentheses but they must be there!

Files in Python

Opening and Closing

Big Picture

• To use a file in a programming language– You have to open the file– Then you process the data in the file– Then you close the file when you are done with it

• This is true for input files or output files

Opening a file

• To use a file, you first have to open it• in Python the syntax is

infile = open(“xyz.txt”, “r”) # for input (read)oroutfile = open(“mydata.txt”, “w”) # for output It creates a link between that variable name in the program and the file known to the OS

Processing in general

• Processing is general term• In Python there are at least 4 ways to read

from an input file • And two ways to write to an output file• They all use loops in one way or another• See other talks for details

Closing a file

• When you are finished with the file (usually when you are at the end of the data if it is input)

• You close the file• In Python the syntax is

infile.close()Works for input or output filesNote: no arguments but you MUST have () !!Otherwise the function is not actually called!

Files in Python

Input techniques

Input from a file

• The type of data you will get from a file is always string or a list of strings.

• There are two ways of reading that I call “bulk reads” because with one statement they totally exhaust the file. There is no more to read after that!

• The other two ways read a line at a time from the file

• Files are objects so most of these will be methods called with the dot notation as usual

read()

• The read method is called like this datastr = infile.read()

• What does it do? it reads in the entire file of data, into one string variable

• The newlines and other whitespace in the file are stored in the string like every other character

• Be aware if you are reading a LARGE file, this may take some time and a lot of RAM!

• This is convenient if you do not care particularly where the newlines are in the file

• BULK

readlines()

• The syntax: datalst = infile.readlines()• This method reads in ALL the data from the file and

uses the \n as a delimiter to break the data into strings in a list

• There is nothing more to read in the file after you execute one readlines call.

• This is convenient if you know the data in the file is organized by lines, i.e. each line needs to be processed by itself

• BULK

readline()

• Note that this is a different method from readlines – note the s!

• syntax: datastr = infile.readline()• Semantics: it reads in the next line of data from the file, up to

the next newline• Returns a string which has the data and a \n character at the

end• Useful when you don’t want to read in ALL the data at one

time, or when you have more data than RAM space to hold it• Usually used inside a while loop• Indicates the end of the data in the file by returning an empty

string. Note that this is different from having an empty or blank line in the file – that is returned as “\n”

Files in Python

Caution about readlines vs. read and split

You would think that

lines = infile.readlines() and line = infile.read() lines = line.split(‘\n’)

would give the same result in the variable lines, that is, a list of strings from the file, delimited by the newline characters.

You would be surprised!

• readlines() gives you a list of strings, each with a \n at the end

• Except! if you did not press Enter on the last line of the data file, the last string in the list will not have a \n in it

• read() followed by split(‘\n’) gives a list of strings, yes, but none of them will have \n in them (remember split removes the delimiters from its results)

And another surprise!

• If you did press Enter on the last line of the data file, readlines still works properly. The last string in the list will have a \n character just like all the others

• BUT the same file read with the read/split combination will have one extra entry, an empty string at the end of the list

• This is something you need to be aware of while processing your data – many programs crash because they assume that every string will be the same length, for example.

Files in Python

Output techniques

Outputting to a file

• There are two ways to do this in Python– print (more familiar, more flexible)– write (more restrictive)

Using print to output to a file

• You add one argument to the print function call. At the end of the argument list, put “file=“ followed by the name of the file object you have opened for output

• Example print(“hi”, a, c*23, end=“”, file= outfile)• You can use anything in this print that you would in

printing to the screen, end=, sep=, escaped characters, etc.

• Default end= and sep=, so gives a newline at the end of every print unless you give different value

• Note it says file=outfile, NOT file = “abc.txt”

Using write to output to a file

• write is a method, similar to the Text object in the graphics package

• it is called by the output file object (dot notation)• It is allowed ONE and only one STRING argument, so

you have to convert numbers to strings and concatenate strings together to make one argument

• Example outfile.write(“hi”+str(ct)+”\n”)• Does NOT output a newline automatically, if you

want one, you have to put one in the string

Files in Python

When does it crash?

How a file can make a program crash

• For input files, there are several things that can happen which can cause a program to crash

• Some are avoidable with some care, some are not– the file does not exist that you are trying to open – trying to read past the end of the file – the data in the file is not laid out as the program

expects – the file exists but is empty

Output files

• An output file is constructive and destructive– If the file you are opening to write to does NOT exist, it is

created• Note that if you gave the path to the folder as part of the file name, the

open will NOT create folders!• In other words, outfile = open(“c:\\My Documents\\cs115\\file1.txt”,

“w”) will only work if the path already exists and you have permission to write to it

– If the file you are opening to write to DOES exist already, all data is destroyed• tells the OS to set the length of the file to zero bytes!

• If you try to write to a medium that is full, your program will crash