Temas de Instrumentación Electrónica
CURSO 2002
This document tries to give the reader basic knowledge in compiling C
and C++ programs on a Unix system. If you've no knowledge as to how to
compile C programs under Unix (for instance, you did that until now on
other operating systems), you'd better read this tutorial first, and then
write a few programs before you try to get to gdb, makefiles or C libraries.
If you're already familiar with that, it's recommended to learn about makefiles,
and then go and learn other C programming topics and practice the usage of
makefiles, before going on to read about C libraries. This last issue is
only relevant to larger projects, while makefiles make sense even for a small
program composed of but a few source files.
As a policy, we'll stick with the basic features of programming tools mentioned here, so that the information will apply to more then a single tool version. This way, you might find the information here useful, even if the system you're using does not have the GNU tools installed.
In this lovely tutorial, we'll deal with compilation of a C program, using the compiler directly from the command line. It might be that you'll eventually use a more sophisticated interface (an IDE - Integrated Development Environment) of some sort, but the common denominator you'll always find is the plain command line interface. Further more, even if you use an IDE, it could help you understand how things work "behind the scenes". We'll see how to compile a program, how to combine several source files into a single program, how to add debug information and how to optimize code.
The easiest case of compilation is when you have all your source code set in
a single file. This removes any unnecessary steps of synchronizing several files
or thinking too much. Lets assume there is a file named
'single_main.c' that
we want to compile. We will do so using a command line similar to this:
cc single_main.c
Note that we assume the compiler is called "cc". If you're using a GNU compiler,
you'll write 'gcc' instead. If you're using a Solaris system, you might use
'acc', and so on. Every compiler might show its messages (errors, warnings,
etc.) differently, but in all cases, you'll get a file 'a.out' as a result,
if the compilation completed successfully. Note that some older systems
(e.g. SunOs) come with a C compiler that does not understand ANSI-C, but rather
the older 'K&R' C style. In such a case, you'll need to use gcc (hopefully
it is installed), or learn the differences between ANSI-C and K&R C (not
recommended if you don't really have to), or move to a different
system.
You might complain that 'a.out' is a too generic name (where does it come from
anyway? - well, that's a historical name, due to the usage of something
called "a.out format" for programs compiled on older Unix systems). Suppose
that you want the resulting program to be called "single_main". In that case,
you could use the following line to compile it:
cc single_main.c -o single_main
Every compiler I've met so far (including the glorious gcc) recognized the '-o'
flag as "name the resulting executable file 'single_main'".
Once we created the program, we wish to run it. This is usually done by simply
typing its name, as in:
single_main
However, this requires that the current directory be in our PATH (which is
a variable telling our Unix shell where to look for programs we're trying
to run). In many cases, this directory is not placed in our PATH. Aha! - we say.
Then lets show this computer who is smarter, and thus we try:
./single_main
This time we explicitly told our Unix shell that we want to run the program
from the current directory. If we're lucky enough, this will suffice. However,
yet one more obstacle could block our path - file permission flags.
When a file is created in the system, it is immediately given some access permission flags. These flags tell the system who should be given access to the file, and what kind of access will be given to them. Traditional Unix systems use 3 kinds of entities to which they grant (or deny) access: The user which owns the file, the group which owns the file, and everybody else. Each of these entities may be given access to read the file ('r'), write to the file ('w') and execute the file ('x').
Now, when the compiler created the program file for us, we became owners of
the file. Normally, the compiler would make sure that we get all permissions
to the file - read, write and execute. It might be, thought that something
went wrong, and the permissions are set differently. In that case, we can
set the permissions of the file properly (the owner of a file can normally
change the permission flags of the file), using a command like this:
chmod u+rwx single_main
This means "the user ('u') should be given ('+') permissions read ('r'),
write ('w') and execute ('x') to the file 'single_main'. Now we'll surely be
able to run our program. Again, normally you'll have no problem running the
file, but if you copy it to a different directory, or transfer it to a
different computer over the network, it might loose its original permissions,
and thus you'll need to set them properly, as shown above. Note too that you
cannot just move the file to a different computer an expect it to run - it has
to be a computer with a matching operating system (to understand the executable
file format), and matching CPU architecture (to understand the machine-language
code that the executable file contains).
Finally, the run-time environment has to match. For example, if we compiled the program on an operating system with one version of the standard C library, and we try to run it on a version with an incompatible standard C library, the program might crush, or complain that it cannot find the relevant C library. This is especially true for systems that evolve quickly (e.g. Linux with libc5 vs. Linux with libc6), so beware.
Normally, when we write a program, we want to be able to debug it - that is,
test it using a debugger that allows running it step by step, setting
a break point before a given command is executed, looking at contents
of variables during program execution, and so on. In order for the debugger
to be able to relate between the executable program and the original source
code, we need to tell the compiler to insert information to the resulting
executable program that'll help the debugger. This information is called
"debug information". In order to add that to our program, lets compile it
differently:
cc -g single_main.c -o single_main
The '-g' flag tells the compiler to use debug info, and is recognized by
mostly any compiler out there. You will note that the resulting file is much
larger then that created without usage of the '-g' flag. The difference in size
is due to the debug information. We may still remove this debug information
using the strip
command, like this:
strip single_main
You'll note that the size of the file now is even smaller then if we didn't use
the '-g' flag in the first place. This is because even a program compiled
without the '-g' flag contains some symbol information (function names,
for instance), that the strip
command removes. You may want to
read strip
's manual page (man strip) to understand more
about what this command does.
After we created a program and debugged it properly, we normally want it
to compile into an efficient code, and the resulting file to be as small
as possible. The compiler can help us by optimizing the code, either
for speed (to run faster), or for space (to occupy a smaller space), or
some combination of the two. The basic way to create an optimized
program would be like this:
cc -O single_main.c -o single_main
The '-O' flag tells the compiler to optimize the code. This also means
the compilation will take longer, as the compiler tries to apply
various optimization algorithms to the code. This optimization is supposed
to be conservative, in that it ensures us the code will still perform the
same functionality as it did when compiled without optimization (well,
unless there are bugs in our compiler). Usually can define an optimization
level by adding a number to the '-O' flag. The higher the number - the
better optimized the resulting program will be, and the slower the compiler
will complete the compilation. One should note that because optimization
alters the code in various ways, as we increase the optimization level
of the code, the chances are higher that an improper optimization will
actually alter our code, as some of them tend to be non-conservative,
or are simply rather complex, and contain bugs. For example, for a long
time it was known that using a compilation level higher then 2 (or was
it higher then 3?) with gcc results bugs in the executable program. After
being warned, if we still want to use a different optimization level (lets
say 4), we can do it this way:
cc -O4 single_compile.c -o single_compile
And we're done with it. If you'll read your compiler's manual page, you'll
soon notice that it supports an almost infinite number of command line options
dealing with optimization. Using them properly requires thorough understanding
of compilation theory and source code optimization theory, or you might damage
your resulting code. A good compilation theory course (preferably based on
"the Dragon Book" by Aho, Sethi and Ulman) could do you good.
Normally the compiler only generates error messages about erroneous code
that does not comply with the C standard, and warnings about things that
usually tend to cause errors during runtime. However, we can usually instruct
the compiler to give us even more warnings, which is useful to improve the
quality of our source code, and to expose bugs that will really bug us later.
With gcc, this is done using the '-W' flag. For example, to get the compiler
to use all types of warnings it is familiar with, we'll use a command line
like this:
cc -Wall single_source.c -o single_source
This will first annoy us - we'll get all sorts of warnings that might
seem irrelevant. However, it is better to eliminate the warnings then
to eliminate the usage of this flag. Usually, this option will save us
more time than it will cause us to waste, and if used consistently, we will
get used to coding proper code without thinking too much about it. One should
also note that some code that works on some architecture with one compiler,
might break if we use a different compiler, or a different system, to compile
the code on. When developing on the first system, we'll never see these bugs,
but when moving the code to a different platform, the bug will suddenly appear.
Also, in many cases we eventually will want to move the code to a new
system, even if we had no such intentions initially.
Note that sometimes '-Wall' will give you too many errors, and then you could try to use some less verbose warning level. Read the compiler's manual to learn about the various '-W' options, and use those that would give you the greatest benefit. Initially they might sound too strange to make any sense, but if you are (or when you will become) a more experienced programmer, you will learn which could be of good use to you.
Now that we saw how to compile C programs, the transition to C++ programs is
rather simple. All we need to do is use a C++ compiler, in place of the C
compiler we used so far. So, if our program source is in a file named
'single_main.cc' ('cc' to denote C++ code.
Some programmers prefer a suffix
of 'C' for C++ code), we will use a command such as the following:
g++ single_main.cc -o single_main
Or on some systems you'll use "CC" instead of "g++" (for example, with
Sun's compiler for Solaris), or "aCC" (HP's compiler), and so on. You would
note that with C++ compilers there is less uniformity regarding command
line options, partially because until recently the language was evolving and
had no agreed standard. But still, at least with g++, you will use "-g" for
debug information in the code, and "-O" for optimization.
So you learned how to compile a single-source program properly (hopefully by now you played a little with the compiler and tried out a few examples of your own). Yet, sooner or later you'll see that having all the source in a single file is rather limiting, for several reasons:
There are two possible ways to compile a multi-source C program. The first
is to use a single command line to compile all the files. Suppose that we
have a program whose source is found in files
"main.c",
"a.c" and
"b.c"
(found in directory "multi-source" of this tutorial).
We could compile it this way:
cc main.c a.c b.c -o hello_world
This will cause the compiler to compile each of the given files separately, and
then link them all together to one executable file named "hello_world". Two
comments about this program:
"extern"
keyword.
In order to overcome this limitation, we could divide the compilation process
into two phases - compiling, and linking. Lets first see how this is done,
and then explain:
cc -c main.cc
cc -c a.c
cc -c b.c
cc main.o a.o b.o -o hello_world
To see why this complexity actually helps us, we should note that normally
the link phase is much faster then the compilation phase. This is especially
true when doing optimizations, since that step is done before linking. Now,
lets assume we change the source file "a.c", and we want to re-compile the
program. We'll only need now two commands:
cc -c a.c
cc main.o a.o b.o -o hello_world
Now that we've learned that compilation is not just a simple process, lets try to see what is the complete list of steps taken by the compiler in order to compile a C program.
Driver
- what we invoked as "cc". This is actually
the "engine", that drives the whole set of tools the compiler is made of.
We invoke it, and it begins to invoke the other tools one by one, passing
the output of each tool as an input to the next tool.
C Pre-Processor
- normally called "cpp". It takes
a C source file, and handles all the pre-processor definitions (#include
files, #define macros, conditional source code inclusion with #ifdef, etc.)
You can invoke it separately on your program, usually with a command like:
The C Compiler
- normally called "cc1". This is the
actual compiler, that translates the input file into assembly language.
As you saw, we used the "-c" flag to invoke it, along with the C
Pre-Processor, (and possibly the optimizer too, read on), and the
assembler.
Optimizer
- sometimes comes as a separate module
and sometimes as the found inside the compiler module. This one
handles the optimization on a representation of the code that is
language-neutral. This way, you can use the same optimizer for compilers
of different programming languages.
Assembler
- sometimes called "as". This takes
the assembly code generated by the compiler, and translates it into
machine language code kept in object files. With gcc, you could tell the
driver to generated only the assembly code, by a command like:
cc -S single_source.c
Linker-Loader
- This is the tool that takes all
the object files (and C libraries), and links them together, to form
one executable file, in a format the operating system supports.
A Common format these days is known as "ELF". On SunOs systems,
and other older systems, a format named "a.out" was used. This format
defines the internal structure of the executable file - location of
data segment, location of source code segment, location of debug
information and so on.
As you see, the compilation is split in to many different phases. Not all compiler employs exactly the same phases, and sometimes (e.g. for C++ compilers) the situation is even more complex. But the basic idea is quite similar - split the compiler into many different parts, to give the programmer more flexibility, and to allow the compiler developers to re-use as many modules as possible in different compilers for different languages (by replacing the preprocessor and compiler modules), or for different architectures (by replacing the assembly and linker-loader parts).
Temas de Instrumentación Electrónica
CURSO 2002
Tomado de: http://users.actcom.co.il/~choo/lupg/tutorials