Pointer Assignment Segmentation Fault Ubuntu

I'm really stumped on an issue I'm having with a project for class. I'm a third year CS student and part of our graduation requirements entail us doing a fairly in depth development project. Mine and a partner's task is to implement an updated version of the Accelerated GCD algorithm in the GMP MP Library. It should be a really rewarding project and our goal is to ascertain whether or not this updated algorithm will perform better under RSA sized inputs (since the Extended GCD algorithm is commonly used in RSA encryption and decryption).

Anyways, I'm having an issue trying to return a value from a function in our gcd.c file.

Here is what I want to do:
I have a method called find_a that returns a mp_limb_t value (which is equivalent in precision to a standard int type). What I need to do though is modify that function to not only return that value, but also a correlating value so I plan to do that through an out-mode parameter in the function parameter list. I am using the following syntax:

mp_limb_t find_a (mp_ptr n1p, ....) where mp_ptr is defined as mp_limb_t * in the gmp.h header file.

Inside of the function body I am trying to dereference the pointer and assign it a limb value so I am saying:

*n1p = n1_l;

When I run this program I get a segmentation fault and I really don't know what the problem could be.

I know I should be able to do this because I wrote a small dummy program to test what I wanted to do and it worked fine. Here is the code:

#include <stdio.h>

typedef int * int_ptr;

int ptrFunc(int_ptr r, int n) {
*r = n * 2;
return n - 10;
}

int main(void) {
int x, z;
int_ptr y;

x = 45;

printf("Before ptrFunc: x=%d, y=%d, z=%d\n", x, *y, z);
z = ptrFunc(y, x);
printf("After ptrFunc: x=%d, y=%d, z=%d\n", x, *y, z);
}

And my output is:
Before ptrFunc: x=45, y=0, z=0
After ptrFunc: x=45, y=90, z=35
23905:
23905: calling fini: /lib/tls/i686/cmov/libc.so.6 [0]
23905:

Now that last part is confusing me too, and perhaps that is where my error is coming from. I don't know what the problem is but I have stared at it all day and maybe you guys can help me out. Two eyes are always better than one.

As if you can't tell, C is not my native language. I started programming in Java and I am one of the people who believes object-orientated style is the most natural way to program. C is nothing more than glorified assembly language to me but it's a lot of fun to learn and experiment in an unfamiliar language. Thanks for any help anyone can provide.

Debugging

Code debugging is the process of detecting and eliminating the logical errors in the code. The debugger can be used in combination with valgrind like utility to detect and eliminate the source of segmentation faults. Each program (process) has access to certain portion of memory (user space or program space). During run time if the process tries to access the memory outside its allocated space, the segmentation fault occurs. Segmentation faults are referred to as segfault, access violation or bus error. Hardware notifies the operating system about memory access violation. The OS kernel then sends the segmentation fault signal (SIGSEGV in Linux) to the process which caused the exception. The process then dump core and terminates. The core dump refers to the recorded state of the working memory of a computer program at a specific time when the program crashed. The state contains processor registers which may include the program counter and stack pointer, memory management information, and other processor and operating system flags and information. Core dumps are often used to assist in diagnosing and debugging errors in computer programs. Before proceeding to the Section B, the overview in Section A and Appendix A to D can be helpful.

For parallel Debugging, look for TotalView which has been installed in HPCC.

Quick Guide

Compile with a ‘g’(debug) flag:

Find the source file hello.c in Appendix A

(Note: debug mode is necessary to get the line number)

Debugging with Valgrind utlity:

A. Process Flow for Execution of a Code

Case Study:  Printing “Hello World” on the console – Linux Environment, x86_64 Architecture






Type the following in a bash shell:

The linker creates an executable in a container format understood by the target Linux System – the Executable and Linking Format (ELF). The bash process makes a system call (Fig. 1) and the mode is changed from user to system (kernel). The fork system call creates a new task cloning the current process. The new process (child) gets unique PID and has the PID of the old parent process (PPID). The exec () system call replaces the entire current process with a new program i.e. bash replaced by a.out as showed in Fig. 1.

Fig. 1: Typical fork()/exec() operations where the bash shell is used to execute a.out [8]

At the end of the exec() system call, there is a new process entry of type task_struct in a process table, a circular doubly linked list showed in Fig. 2, waiting for scheduler to run; the wait time depending on the start timestamp in task_struct assigned by the kernel scheduling algorithm.

Fig. 2. Circular linked list representation of processes; init task’s process descriptor (pointer to task_struct) is statically allocated.

Each process is assigned its own virtual memory as showed in Fig. 3. The major portion is allocated for user space and rest will be for kernel space. At any given period, process will not be using all of the code, data, and libraries. So, demand paging is used to map the portion of the virtual memory into physical memory when a process attempts to use. The process’s page table is altered marking virtual areas as existing but not in memory. The pager only loads those pages that it expects the process to use right away into the physical memory; the other data in the address space are not available in the main memory. If the program attempts to access data not currently located in the main memory, there will be a page fault and operating system resolves it allowing the program to continue operation.

Fig.3: Process Virtual Memory space [9]

After the hardware timer interrupts when the current process (a.out) is assigned a CPU slice, the instruction cycle starts. The Program Control Block (PCB) data structure, consisting of the program counter and all of the CPU’s registers as well as the process stacks containing temporary data, is loaded into CPU registers along with Program Counter (PC). In the fetch state, the data pointed by the address in the program counter is fetched to the current instruction register (CIR).  Initially, CPU fetch unit looks for data in the cache.  If it is the first time, there will be no data in the cache and after the cache miss, the data is obtained from the system memory (DRAM).  While loading the instruction in the instruction register, a circuit called memory cache controller loads a small block of data below the current position (address) that the CPU has just loaded (usually program flow is sequential) and the next position the CPU request will probably be the address immediately below the recent one. So, the next required data will probably reside in that block in the cache and more chances of cache hit.

Upon fetching the instruction, the program counter is incremented by one "address value" (to the location of the next instruction). Now, the instruction (op-code) in the CIR is decoded and the required data (operands) as interpreted are fetched to the data registers. The control unit passes decoded instruction to different parts of CPU (e.g. ALU) for execution. The results are stored in the main memory or sent to I/O devices.

If the peripheral components are involved to transfer their I/O data to and from main memory, DMA (Direct Memory Access) mechanism allows it without involving the CPU. The process is put to sleep during the transfer of data to DMA buffer and interrupt is used to awaken the process when it is time to read the data.

B. Debugging Segmentation Faults

The example code (test.cpp; Appendix E ) has been compiled with GNU compiler and the segmentation fault has been detected using the valgrind utility. The valgrind can be used for intel compilers as well. Valgrind is a memory mismanagement detector which shows memory leaks, deallocation errors etc. The caveat is Valgrind can consume twice as much memory and takes longer to run the code. It can not detect the buffer overflow unless it causes stack overflow or corrupt the return address of the function.

Compile with a ‘g’(debug) flag: g++ -g –o debug test.cpp

(Note: debug mode is necessary to get the line number)

Debugging with Valgrind utlity: valgrind --tool=memcheck --leak-check=yes -v --leak-check=full --show-reachable=yes ./debug

Stack Overflow

The variables, arrays, pointers, and arguments of a code reside in the General Purpose Registers and stack assigned for a process (program or user space). The stack has a fixed size. For example, if the array is too big or the recursive function is consuming lots of stack space, the stack overflow occurs. The crossing of the boundary of the stack limit generates signal SIGSEGV causing segfault. As a solution, you may increase the size of the stack, however, dynamic allocation is the better way to resolve the issue.

Though Java or Java like languages throw “array out of bound” exception, C expects user to take care of it. It saves the space for the user that has been asked but does not check if the user is going out of bound. Array in the stack can overflow producing undefined values as long as there is not stack overflow or overwriting of the return address of the function leading to arbitrary code execution. There is a possibility of security threat as the return address can be replaced by the return address of another malicious program.

In debug.cpp, when the size of the array for b is huge (b[100000] = 10) at line 16, the following message is displayed:

If you use smaller size of array, let’s say (b[100] = 10), there will be no segfault and you will get the output as 10. So, C/C++ does not check the array bound. Here, the left string prefixes ==27229== represents each line of Valgrind-output and the number 277229 is the PID of the processor.

Invalid Read 

You are trying to access the value after it is freed. Here, in the first iteration, the node is freed "free(node)", however, it is tried to be accessed through "node->next" as showed:






Also, you may have run out of the mapped region of memory address and hence the data can no longer be accessible. One scenario is showed below. The code snippet is changed from:








into:








The valgrind output looks similar to:









Not Enough Memory Allocated in the heap

In the debug.cpp code, the memory is allocated for two integers (a = new int[2]) but its 100th integer location has been assigned a value (a[100]=4) for which no memory is allocated. The error is in line 15 of the code.

Memory Leak

The memory leaks occur if the heap memory is not properly freed. In the test.cpp code, pointer ‘c’ is set to point to pointer ‘a’. So, “delete []c” cannot free the memory allocated (2* 4bytes) of memory, and hence the message below:

Again, the command delete []c is again trying to free the allocated memory which has already been freed using delete [] a, and hence the Invalid free() as showed below:

Uninitialized Values

If you try to print the value the variable contains which has not been previously assigned, the message looks like this:

In line 20 (printf("%d %d\n", a[0],b[5]), the value for b[5] which has never been assigned before, is attempted to be printed out. Again in line 23 (if (b[5]<2) b[4] = 5), the decision is based on the unassigned value, so you will be getting the following message:

Appendices

Appendix A

hello.c

Appendix B

Appendix C:

Memory Management

Fig. Memory Hierarchy of a Modern Computer System [10]

CPU Registers:

Only few registers are available on the processor. For example, Intel Chips processor have 6 general purpose registers and specialized registers including a base register, stack register, flag register, program counter, and addressing registers. Same speed as the rest of the CPU. They store the address of the currently executed instructions as well as data.

Fig. A1. The x86_64 General Purpose Register; in x86, prefix r is replaced by e i.e. %rax => %eax; the 7th argument and onwards are passed on to the stack.















Fig. A2: Status of registers and stack frame for the C function myfunc; rbp pointing to the base of the frame whereas rsp pointing to the top of the frame; each register is 8bytes (64bits) long; stack growing from higher address to lower address [2].

Cache:

The cache works on two principles: temporal locality and spatial locality. The former is the idea that if you recently used a certain chunk of data, you'll probably need it again soon. The latter means that if you recently used the data at address X, you'll probably soon need address X+1. It means that you'll always want your own code to exploit these two forms of locality as much as possible without jumping all over memory. Working on one small area, and then move on to the next, improve the performance.

The cache is there to reduce the number of times the CPU would stall waiting for a memory request to be fulfilled (avoiding the memory latency), and as a second effect, possibly to reduce the overall amount of data that needs to be transferred (preserving memory bandwidth). Whenever data is to be read from main memory, the system hardware first checks for that data in the cache to speed up access. Different areas of RAMs are cached at different times through mapping as cache is much smaller space than main memory. When writing data from the CPU, the data is first written to cache before being written on main memory. Different levels of Cache – L1 located directly on the CPU chip, L2 is the part of CPU module, and L3 is the part of the system motherboard. The cache is made out of SRAM, a solid state device. It is smaller but faster but requires more transistors per bit. SRAM does not need to be refreshed as the transistors (flip flops) inside would continue to hold the data as long as the power supply is not cut off. L1 cache is integrated in a CPU chip as is faster because of shorter data/address path.

Performance Improvement [14]:

·Use smaller data types

·Organize your data to avoid alignment holes (sorting your struct members by decreasing size is one way)

·Beware of the standard dynamic memory allocator, which may introduce holes and spread your data around in memory as it warms up.

·Make sure all adjacent data is actually used in the hot loops. Otherwise, consider breaking up data structures into hot and cold components, so that the hot loops use hot data.

·Avoid algorithms and data structures that exhibit irregular access patterns, and favor linear data structures.

Modern CPUs often have one or more hardware prefetchers. They train on the misses in a cache and try to spot regularities. For instance, after a few misses to subsequent cache lines, the hardware prefetcher will start fetching cache lines into the cache, anticipating the application's needs. If you have a regular access pattern, the hardware prefetcher is usually doing a very good job. And if your program doesn't display regular access patterns, you may improve things by adding prefetch instructions yourself.

Regrouping instructions in such a way that those that always miss in the cache occur close to each other, the CPU can sometimes overlap these fetches so that the application only sustain one latency hit (Memory level parallelism). To reduce the overall memory bus pressure, you have to start addressing what is called temporal locality. This means that you have to reuse data while it still hasn't been evicted from the cache. Merging loops that touch the same data (loop fusion), and employing rewriting techniques known as tiling or blocking all strive to avoid those extra memory fetches. While there are some rules of thumb for this rewrite exercise, you typically have to carefully consider loop carried data dependencies, to ensure that you don't affect the semantics of the program.

ROM:

ROM is a non-volatile memory where the BIOS resides i.e. ROM stores the initial program.  BIOS contains all the codes required to control input/output devices (keyboard, disk drives, communication) and miscellaneous functions. It makes it possible for the computer to boot; BIOS is copied from ROM to main memory known as shadowing as the main memory is faster than ROM.

Main Memory:

The data and programs that are being used are stored in main memory. It is a solid state device which is made of DRAM chips that have power connection, data connection, read/write signal connection, and address connection. Unlike SRAM, DRAM using capacitors requires data to be refreshed periodically in order to retain the data. The additional circuitry and timing for refreshing data makes DRAM memory slower than SRAM.

Virtual Memory:

There is never enough main memory so Virtual memory is the concept of combining main memory with the slower storage (hard drive) giving the system the appearance of having much more main memory than is actually available. The machine code for the application consumes some bytes on top of additional bytes for data storage and I/O buffers called application’s address space. If the address space is more than the main memory, the application would not have run if there were no virtual memory.

The memory management hardware called Memory Management Unit (MMU) divides main memory into pages – contiguous sections of memory of a set size known as paging. The actual physical layout is controlled by process’s page table. When a program is executed, process (with unique PID) is assigned for that program by the Kernel/OS.  The executable image file consists of both executable code and data along with the information necessary to load the executable code and associated data into the virtual memory of the process. During the execution, processor can allocate memory to use which needs to be linked into processor’s existing virtual memory. The shared libraries are also linked into the process’s virtual address space along with other processes’ virtual space. The great whole in the middle of the address space may never be used (e.g. arrays are often oversized and certain portions of the code are rarely used) unless the stack/heap grows to fill the gap.

At any given period, process will not be using all of the code, data, and libraries. So, demand paging is used to map the portion of the virtual memory into physical memory when a process attempts to use. The process’s page table is altered marking virtual areas as existing but not in memory. The pager only loads those pages that it expects the process to need right away into the physical memory; the other data in the address space are not available in the main memory. If the program attempts to access data not currently located in the main memory, there will be a page fault and operating system resolves it allowing the program to continue operation. This new page will be included in a working set, a group of main memory pages currently dedicated to the specific process. So, the working set grows with more page faults but shrinks when the pages are turned into free pages writing them into swapping space of mass storage device. The condition of excessive swapping is known as thrashing and indicates insufficient main memory for the present workload.

Each process has its own virtual memory (about 4gb) which maps to the physical memory through page tables. The virtual memory will split into major portion (about 3gb) for user space and small portion (about 1gb) for the use of kernel space. Kernel space is mapped to the starting locations of physical memory where the kernel image is loaded at the boot time. Besides the program instructions and data, the process also includes the program counter and all of the CPU’s registers as well as the process stacks containing temporary data (routine parameters, return addresses, saved variables).  This current context data is stored as process control block (PCB) data structure in Kernel space. When the process resumes (switching context), the program counter from PCB is loaded and execution continues. Each process, when created, is allocated with a new task_struct  data structure in system (main) memory and is added into the task vector, an array of pointers pointing to every task_struct. The size of the task vector determines the maximum number of processes.

Hard Drives:

Hard drive is non-volatile in nature i.e. data remains in it even after the power is removed. So, the programs and data for longer-use are stored in it. But the program has to be read into the main memory from hard drive to be executed. The hard drive is electro mechanical in nature and consists of phases – access arm movement, disk rotation, head reading/writing, and data transfer.

Off-line Backup Storage (tape, optical Discs):

This storage is usually for archiving data. The access time will be in seconds and capacity around Tera Byte to Peta Byte.

Operating System and Kernels

The kernel is the core of the operating system, a logic or program which has full access to all memory and hardware and acts as an interface between the user application and the hardware. User space program can not directly access system resources but Kernel handle the access on the program’s behalf through system calls (e.g. printing data on the console; printf in C) allowing only well trusted code to run in the kernel mode/space. System call checks the arguments in the user program, builds the data structure to copy the arguments to the kernel, and executes the special instruction called software interrupt. The interrupt hardware of the CPU then saves the state of the user’s program and switches the privilege level to the kernel mode. The kernel is responsible for memory management, process management, device management, interrupt handling, I/O communication, file systems, and scheduling process. The Kernel build along with the user friendly applications and utilities constitute the operating system.

The flow control during a boot is from BIOS, to boot loader, to kernel. The first stage is loaded and executed by the BIOS from the Master Boot Record (MBR) or another boot loader. The second stage of boot loader, when loaded and executed, displays GRUB startup menu allowing user to choose the operating system or examine/edit startup parameters. The boot loader loads the operating system (presented with boot options), sets up system functions (hardware, memory paging), and starts the kernel (start_kernel). The kernel then starts the program init (/sbin/init), the father of all processes, responsible mostly for running startup scripts for run level (/etc/rc.d/rc#.d with S (boot) or K(shutdown) prefix running in numerical order) presenting the user with a user space (login screen). Init creates processes from a script stored in the file /etc/inittab and then goes dormant waiting for three events to happen – processes end or die, power failure signal, or a request via /sbin/telinit to change the runlevel. The new processes may go on creating new processes (child process). For an example, the getty process creates a login process when a user attempts to login. All of the processes in the system are descended from the init kernel thread.

Fig. C1: Process States [7]

Direct Memory Access (DMA):

DMA allows certain peripheral devices or hardware subsystem within the computer to access the main memory without the need to involve the system processor eliminating computational (CPU) overhead. During time consuming I/O operations, CPU can perform other operations while the transfer is in the process with the process in sleep mode. DMA issues an interrupt when the operation is done to awaken the process. Hardware systems like disk drive controller, graphics cards, networks cards, and sound cards use DMA. DMA is also being used for intra-chip data transfer in multi-core processors and memory-to-memory copy.

Cached Memory in Linux:

Linux system makes use of various caches between the filesystems abstracted through Virtual Files System (VFS) and user level processes.  VFS provides a uniform interface for the kernel for various I/O requests; the most important service is providing a uniform I/O data cache. The four caches of I/O data that Linux maintains are page cache, i-node cache, buffer cache, and directory cache as showed in Fig C2.


Fig. C2: Linux Kernel IO Cache [15]

Inode , an index node, is a data structure in Linux, which stores all the information (metadata information such as file ownership, access mode, file type etc) about a file system object (file, device node, socket, pipe, etc) but the data content and filename. Directory Entry (dentry): /usr/src/kernels/2.6.18-348.3.1.el5-x86_64/include/linux/dcache.h, manages the hierarchical nature of the file system with a root dentry and its child entries.

Page Cache:

Page cache accelerates the file access by storing the data in unused areas of system memory called cached memory during first file read from hard drives or write to them. When the data is read again later, it can be read quickly from the cache. Page cache combines virtual memory and file data.

Example:

Execution time for link list (ll) command (time ll) for the first time is shorter than the subsequent ones as showed below:






Also, the amount of cache increases after the first ll as showed below (Cached Memory in Ganglia):

output:           


output:  


If the data is written to a file, it is first written to the page cache stored as dirty pages which is periodically transferred or transferred through system call such as sync or fsync.

Inode Cache:

Iode cache speeds up the access of the mounted file system. When the mounted file systems are navigated, the VFS inodes are continually read and written (in some cases).

Directory Cache:

When the directories are looked up by the real file system, their details are added to the directory cache to speed up the access the next time. Since only the short directory entries are cached (up to 15 characters, short directory names are helpful).

Buffer Cache:

Device file or the block device is an interface for the device driver (computer program that operates or controls a particular device attached to the computer such as flash drive) that appears in a file system (/dev) as if it were an ordinary file. In case of a flash drive, if there is a need to access the data, all block data read and write requests are given to the flash drivers in the form of data structures via standard kernel system calls. The device identifier uniquely identifies the device and the block number tells the driver which block to read. To speed up access, Linux maintains the buffer cache to save the information.

Appendix D

Object File

Objdump command in Linux is used to provide thorough information on object files. It can be a very handy tool for normal programmers during debugging.

Command: $ objdump -x -d -S a.out

Output:

Here,

Size: size of the loaded section

 VMA: Virtual Memory Address

LMA: Logical Memory Address

off: offset from the beginning of the file

Appendix E:

test.cpp


References:

One thought on “Pointer Assignment Segmentation Fault Ubuntu

Leave a Reply

Your email address will not be published. Required fields are marked *